General rules

Benchmarks

When writing benchmarks to measure the performance of compilers and computers it is very easy to be fooled; some examples follow below.

An optimizing compiler will remove redundant expressions from loops. Suppose you would like to measure the time of floating point operations or elementary functions.

t = fsecond()
do j = 1, n
  s = s + x * y
end do
t1 = fsecond() - t

t = fsecond()
do j = 1, n
  si = sin(x)
end do
t2 = fsecond() - t

In the loops above s and si will be computed once. It is necessary to change the arguments to get accurate measurements.

Constant expressions will be evaluated once.

Consider the following (now very old; but it fooled me once) Fortran77 program which computes a partial sum of the harmonic series (1/1 + 1/2 + ... + 1/1000000000):

 program fooled_me
 double precision s, fsecond, t
 integer j

 t = fsecond()
 s = 0.0d0
 do j = 1, 1000000000
   s = s + 1.0d0 / j
 end do
 t = fsecond() - t
 print*, 'Time = ', t

 end

On an old 140 MHz Sun and compiling with f77 -fast this takes 2 micro seconds, which is very impressive since this would give roughly 1 Pflops (10^15, P is for Peta which comes after Mega, Giga, Tera) which is not bad for a computer with a theoretical top speed of 6.4 million divisions per second (one division takes 22 cycles). Obviously something is very wrong.

If one studies the output from the assembler (compile using f77 -S -fast) it can be seen that the whole loop has been removed. This is quite reasonable from the compilers point of view . We never make use of the variable s after the loop, so why compute its value. If we do use s, by adding the line print*, s after loop we will get a completely different time. In fact it takes 159 seconds which is what we expect (1000000000 divisions at 6.4e6 per seconds gives 156 seconds).

So, print out (part of) what you compute. If you compute a vector print the first and last elements for example.

Now consider the following code which adds two vectors:

 subroutine add_vectors( x, y, z, n )
 implicit none
 integer n, j
 double precision x(n), y(n), z(n)

 do j = 1, n
   z(j) = x(j) + y(j)
 end do

 end

To get accurate times we repeat the loop a 1000 times. In the first loop below we do not call the routine, we just use the loop. In the second loop we call the routine, and the routine is placed in the same file as the loops. In the third loop we call an add-routine (identical with the one above but with a different name) lying in a separate file, i.e. we compile the main program and the add-routine separately.

The first two loops take about 0.02 s each and the third takes 12 s. n was 100000.

 t = fsecond()
 do k = 1, 1000 ! to get accurate times
   do j = 1, n
     z(j) = x(j) + y(j)
   end do
 end do
 t = fsecond() - t

 t = fsecond()
 do k = 1, 1000 ! to get accurate times
   call add_vectors( x, y, z, n )
 end do
 t = fsecond() - t

 t = fsecond()
 do k = 1, 1000 ! to get accurate times
   call add_vectors_ext( x, y, z, n )
 end do
 t = fsecond() - t

Place routines which are to be tested in a separate file and not in the main program.

If you are sharing a computer with other users, or of you are running other programs in parallel with your benchmark program your timings will not be very accurate..

If the thing you are testing takes very little time your timings will not be very accurate. You usually have to use a loop to repeat the computation many times. But don't let the compiler fool you. If you have too simple minded loops some parts may be rearranged or deleted.

There are several technologies (e.g. SpeedStep, PowerNow!, Cool'n'Quiet) for dynamically changing the clock frequency of the CPU (to save energy, to reduce the rpm of the cooling fan etc). This leads to problems when we want to do benchmarks, since the CPU may not run at full speed. Here is a small example in Matlab, but there are the same type of problems using other programming languages. The tests were performed on the student machines.

Here are some details: /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies contains the possible scaling factors of the frequency for cpu0. The student machines have two cores with hyper-threading, giving rise to four logical cpus, named cpu0-cpu3. On our machines the factors are 3193000, 3192000, 3059000, 2926000, 2793000, 2660000, 2527000, 2394000, 2261000 and 1197000 implying that the possible clock frequencies are approximately 3.2GHz, down to 1.2GHz. The present scaling factor is found in scaling_cur_freq. The Linux system may move the process between the cores, so one must inspect the four files /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq through /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq. The program below does a number of iterations (n_iter) and in each iteration the current scaling is saved, then a matrix multiplication is performed after which the current scaling is saved again. At the end of the program a table is printed.

function test(n, n_iter) A = rand(n); % create some data B = rand(n); T = zeros(n_iter, 2); for k = 1:n_iter % check current scaling factor T(k, 1) = find_max_freq * 1e-6; % save before C = A * B; % do some computing T(k, 2) = find_max_freq * 1e-6; % save after end % print a table disp(' GHz GHz') fprintf('%4.1f %4.1f\n', T') function freq = find_max_freq load /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq f = scaling_cur_freq; load /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq f(2) = scaling_cur_freq; load /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq f(3) = scaling_cur_freq; load /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq f(4) = scaling_cur_freq; freq = max(f);

Here is a test run in Matlab (version R2009b):

>> test(50, 15)

GHz GHz 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 1.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2 3.2

Note that it takes time (iterations) for the frequency to increase. Very small matrices (short execution time) may never give an increase frequency.

The student Linux system has a command, /usr/sbin/cpuspeed, that can be used to change the frequency, but the command requires root (superuser) privileges (corresponds to administrator privileges in the Windows world). If you have your own Linux machine you can try the command. Read http://carlthompson.net/Software/CPUSpeed for more details.