Banded matrices

More on sparse matrices

Sparse matrices (i.e. large matrices containing many zeros) are a very useful tool in applications, e.g when solving PDEs. Special data structures are used to store such matrices. A simple strategy is to store every nonzero, a(j, k), together with its row- and column indices, j and k. There are more compact data structures, however, and Matlab uses a compressed column format. Suppose we want to store the following matrix:

1 0 0 4 0 0 8
A = 0 2 0 0 0 7 9
0 3 0 5 6 0 0

The elements can, of course, be any double precision numbers. In this example I have chosen the nonzeros 1 through 9 to simplify the following discussion. Matlab would store the matrix something like this (some details are not clear from the documentation).

The nonzeros elements are stored, column by column, in an array, a say. An integer array, row, contains the row indices for the nonzeros. A third integer array, col, contains indices (pointers) into a and row, where a column starts. col contains an extra element, pointing to the next element of a and row (had there been a next element). I have made extra spaces to show the break between columns.

a   = 1 2 3    4 5 6   7 8 9
row = 1 2 3    1 3 3 2 1 2
col = 1   2    4   4   6   7 8   10

Here is another example. We would like to store the matrix produced by the Matlab-command:
A = 5 * eye(5) + diag(1:4, 1) . The three arrays are:

a = 5 1 5 2 5 3 5 4 5
row = 1 1 2 2 3 3 4 4 5
col = 1 2 4 6 8 10

If A is an m x n matrix and the number of nonzeros is nnz, the memory requirements, in bytes, are:

memory = 8 * nnz + 4 * (nnz + n + 1)

So, in the example: 8 * 9 + 4 * (9 + 7 + 1) = 140 bytes. This can also be seen using the whos-command:

>> S=sparse([1 0 0 4 0 0 8; 0 2 0 0 0 7 9; 0 3 0 5 6 0 0]);
>> whos
Name Size Bytes Class
S 3x7 140 double array (sparse)

Grand total is 9 elements using 140 bytes

In some cases a and row may have more than nnz elements since some algorithms (e.g. LU-factorization) may have allocated extra space for reasons of efficiency (nzmax(matrix) returns the number of allocated elements). In out examples nnz(S) equals nzmax(S).

Note that the expression, for memory, contains n, the number of columns. If m is much less than n one will save memory by storing the transpose of the matrix instead (if that is an option). Consider:

>> A = sprandn(2, 100000, 0.001);
>> A_transp = A';
>> nnz(A)
ans = 200
>> whos
Name           Size          Bytes Class
A              2x100000     402404 double array (sparse)
A_transp 100000x2            2412 double array (sparse)

This has some consequences when one is writing code. Here is a simple example, where we compute the sum of the elements in four different ways.

>> tic; for k = 1:1000, s = sum(sum(A, 1)); end; toc
Elapsed time is 4.638800 seconds.
>> tic; for k = 1:1000, s = sum(sum(A, 2)); end; toc
Elapsed time is 0.007592 seconds.
>> tic; for k = 1:1000, s = sum(sum(A_transp, 1)); end; toc
Elapsed time is 0.008057 seconds.
>> tic; for k = 1:1000, s = sum(sum(A_transp, 2)); end; toc
Elapsed time is 3.102995 seconds.

Formulate some interesting conclusions from the above example and use them to speed up the following computation.

Using sparse matrices decreases the memory requirements, but it is harder to write fast code since we usually have to access elements using indirect addressing (pointers). Here is a silly example, look at the times:

>> f = ones(10000, 1); % create a full vector of ones
>> s = sparse(f); % store it in sparse form as well
>> r = []; tic; for k = 1:10000, r = f .* f; end; toc
Elapsed time is 0.515900 seconds.
>> r = []; tic; for k = 1:10000, r = s .* s; end; toc
Elapsed time is 16.059336 seconds.

One would not store a full vector in sparse form, but it does illustrate the overhead caused by the data structure.

Standard algorithms for computing eigenvalues of large sparse matrices (or solving linear systems) are often based on matrix-vector products, y = A * x. As we will see later in the course it takes quite some time to bring the elements of the matrix from the main memory to the floating point unit. So instead of computing A * x where x is a column vector one can compute A * X, where X contains several columns (making it possible to use the matrix elements several times). We talk about block-algorithms (e.g. block-Lanczos, or block-Arnoldi) and the number of columns in X is called the block size. As you will see in the following exercise, the time for multiplying with two columns, say, is less then doing two multiplications with one column each.

Create a sparse matrix by n = 5000; A = spdiags(randn(n, 101), -50:50, n, n); and measure the time it takes to perform Y = A * X, where X = randn(n, bs), bs = 1, 2, ..., 20 (bs for block size). In order to get accurate times, you have to repeat the multiplication some 20 times for each bs. Let time(bs) be the time for one Y = A * X, for block size bs.
Make two plots, one showing time(bs) / time(1) (i.e. normalised time), the other should show (time(bs) / bs) / time(1) (i.e. the normalised time per column). What is optimal block size (just looking at the time and not the convergence properties of the (unknown) algorithm?
Hint: etime.

The matrix in the previous example is a so called band matrix, which is a special type of sparse matrix. Here is a small example. The blue, red and green spheres mark elements that are stored (they may be different of course). The zeros outside the band are not stored. The main diagonal consists of the red elements. The green elements to the immediate right of the main diagonal, make up the first super-diagonal. The blue elements to the immediate left form the first sub-diagonal.

		0	0	0
			0	0
				0

0
0	0

The matrix can be symmetric, having the same number of sub- and super-diagonals. If a matrix has only one sub- and one super-diagonal, we have a tridiagonal matrix etc. The number of super-diagonals is called the upper bandwidth (two in the example), and the number of sub-diagonals is the lower bandwidth (three in the example). The total number of diagonals, six in the example, is the bandwidth. We can store a band matrix in a more compact form, than we can a general sparse matrix.

We would like to see how the performance depends on the bandwidth. So, let S(w) be a matrix of order n = 3000 having lower- and upper-bandwidths both equal to w (use spdiags to create the matrices). Compute 10 matrix-vector products for each bandwidth^(*), and plot the time as a function of w. Measure, and draw, in the same plot the time for a matrix-vector where S is a full matrix of order n = 3000. For what bandwidth is the full multiply faster?
^(*)It is sufficient to test bandwidths, w = 50, 100, 150, ..., until the time exceeds that for the full matrix-vector multiply.

Note, that I am not trying to say that one should use full matrices instead of sparse. If a matrix is large enough sparse storage is the only alternative. This last problem does show that performance may suffer, however. Later in the course, we will fix the problem by calling a faster Fortran-routine from Matlab.

Here is a link, I got from David Heintz, discussing the formation of finite element matrices (large and sparse).