Lapack

Lapack

In this exercise you are going to compare the textbook Cholesky factorization routine with Lapack implementations of the same algorithm. The Cholesky factorization is the first step when solving Ax=b where A is a symmetric and positive definite matrix.

The LU-factorization of a square matrix A is the product A = LU where L is lower triangular (zeros above the diagonal) and U is upper triangular. When A is symmetric and positive definite it is possible to write A = CC^T where C is lower triangular and where T denotes transpose.

One way to write the Cholesky factorization algorithm (using Matlab notation) is shown below. The lower triangle of the original matrix A is replaced by the Cholesky factor C (it is common to replace things like this in order to save space).

n = length(A);
for k=1:n
  A(k, k) = sqrt(A(k, k) - sum(A(k, 1:k-1).^2));
  for i = k+1:n
    A(i, k) = (A(i, k) - sum(A(i, 1:k-1) .* A(k, 1:k-1))) / A(k, k);
  end
end

For more details, see (almost) any standard textbook in numerical analysis. The factorization is available in Matlab using the command chol.

Implement the algorithm above (you may replace the sums by loops) and run it on matrices of order 50, 100, 150, ..., 1200 in Fortran or C. How many floating point operations (additions and multiplications) are used (just look at the algorithm above and count them, and express the number in terms of the dimension n)? How many Gflop/s to you get (make a plot using Matlab)?

Do the same thing using the Lapack routine dpotrf from Lapack in the following three ways:

Fetch the necessary Lapack-routines (Fortran-versions) from Netlib (look under Browse). Use the (slow) unoptimized BLAS in this first test. Note that it is easy to fetch just the necessary routines (you do not have to fetch the whole Lapack-library and all of BLAS).

Comment 2008-04-29: it is slightly harder this year (and following years) since "Include (unoptimized) BLAS routines?" does not seem to work any longer. Is used to work however. For that reason I have put some of the hard-to-find routines on the student system. Look under /chalmers/groups/thomas_math/HPC/Lapack . You must still fetch dpotrf (to practice using Netlib).

Comment 2012-03-27: there is, however, BLAS.

Note: this part will be used in the next assignment as well.
To see how important it is with fast BLAS, you should test an excellent BLAS-library GotoBLAS2. Fetch and unpack the package yourself (it is part of the assignment). Then read 01Readme.txt and proceed (I used gfortran for the Fortran source). Build a single threaded library to make the comparisons fair (but you can test multi-threaded versions too if you want). Note that GotoBLAS2 contains the necessary Lapack-routines as well. Here is an article about the person, Kazushige Goto, behind the routines.

Update 2013-04-22 Note that GotoBLAS2 does not have support for the latest types of CPU:s (e.g. AMD Bulldozer, Intel Sandy Bridge). An alternative for Sandy Bridge, is to use the OpenBLAS package which is based on GotoBLAS2 (thanks to Damiano O. for the link). OpenBLAS does not support the Bulldozer, so you may want to use AMD's ACML-package. An alternative to OpenBLAS is Intel's MKL (which you can try even if you try OpenBLAS or GotoBLAS).

Update 2011-04-12 and 2012-03-29: When you compile on the student machine, you need to specify a target architecture, otherwise the named constant RPREFETCHSIZE will be undefined in kernel/x86_64/gemm_ncopy_4.S. So, when you give the make-command, add TARGET=NEHALEM (in addition to your other make-options). If you have problems with using ifort, change the Fortran-compiler adding FORTRAN=gfortran and F_COMPILER=gfortran when you run make; make sure you do not have ifort in your Linux-path. You want to use BINARY=64 as well.
There is a Lapack-library, /usr/lib64/liblapack.so that comes with Linux distribution. (It is sufficient to link with -llapack, you do not have to supply a link-path.) What do you think of the performance? (Note that the performance depends on what Linux-version you are using.)
Optional (not compulsory). Test Intel's MKL and/or AMD's ACML. These libraries contain both LAPACK and BLAS among other things.

Make some nice plots where it is easy to compare all the tests (so do not plot one curve per page). One plots showing time vs. n for the four cases, and one showing Gflops vs. n for the four cases.

A Fortran-hint. See the pdf-file "Introduction to C, Fortran 90, FORTRAN 77, tcsh and bash" under the Diary as well. This pdf-file contains a similar discussion for C.

One can use the following theorem to construct a symmetric and positive definite matrix:

Theorem: if A is symmetric and

A(k, k) > |A(k, 1)| + |A(k, 2)| + ... + |A(k, k-1)| + |A(k, k+1)| + ... + |A(k, n)|, k = 1, 2, ..., n,

then A is positive definite.

So taking A(k, k) = 1 and small off-diagonal elements will work (do not take zeros).