Benchmarks
When writing benchmarks to measure the performance of compilers and computers it is very easy to be fooled; some examples follow below.
An optimizing compiler will remove redundant expressions from loops. Suppose you would like to measure the time of floating point operations or elementary functions.
t = fsecond()
do j = 1, n
s = s + x * y
end do
t1 = fsecond() - t
t = fsecond()
do j = 1, n
si = sin(x)
end do
t2 = fsecond() - t
In the loops above s
and si
will be computed
once. It is necessary to change the arguments to get
accurate
measurements.
Constant expressions will be evaluated once.
Consider the following (now very old; but it fooled me once) Fortran77 program which computes a partial sum of the harmonic series (1/1 + 1/2 + ... + 1/1000000000):
program fooled_me
double precision s, fsecond, t
integer j
t = fsecond()
s = 0.0d0
do j = 1, 1000000000
s = s + 1.0d0 / j
end do
t = fsecond() - t
print*, 'Time = ', t
end
On an old 140 MHz Sun and compiling with f77 -fast this takes 2 micro seconds, which is very impressive since this would give roughly 1 Pflops (10^15, P is for Peta which comes after Mega, Giga, Tera) which is not bad for a computer with a theoretical top speed of 6.4 million divisions per second (one division takes 22 cycles). Obviously something is very wrong.
If one studies the output from the
assembler (compile using f77 -S -fast)
it can be seen that the whole loop has been removed.
This is quite
reasonable from the compilers point of view . We never make use of the
variable
s after the loop, so why compute its value. If we do
use s,
by adding the line print*, s
after loop we
will get a completely
different time. In fact it takes 159 seconds which is what we expect
(1000000000
divisions at 6.4e6 per seconds gives 156 seconds).
So, print out (part of) what you compute. If you compute a vector print the first and last elements for example.
Now consider the following code which adds two vectors:
subroutine add_vectors( x, y, z, n )
implicit none
integer n, j
double precision x(n), y(n), z(n)
do j = 1, n
z(j) = x(j) + y(j)
end do
end
To get accurate times we repeat the loop a 1000 times. In the first loop below we do not call the routine, we just use the loop. In the second loop we call the routine, and the routine is placed in the same file as the loops. In the third loop we call an add-routine (identical with the one above but with a different name) lying in a separate file, i.e. we compile the main program and the add-routine separately.
The first two loops take about 0.02 s each and the third takes 12 s. n was 100000.
t = fsecond()
do k = 1, 1000 ! to get accurate times
do j = 1, n
z(j) = x(j) + y(j)
end do
end do
t = fsecond() - t
t = fsecond()
do k = 1, 1000 ! to get accurate times
call add_vectors( x, y, z, n )
end do
t = fsecond() - t
t = fsecond()
do k = 1, 1000 ! to get accurate times
call add_vectors_ext( x, y, z, n )
end do
t = fsecond() - t
Place routines which are to be tested in a separate file and not in the main program.
If you are sharing a computer with other users, or of you are running other programs in parallel with your benchmark program your timings will not be very accurate..
If the thing you are testing takes very little time your timings will not be very accurate. You usually have to use a loop to repeat the computation many times. But don't let the compiler fool you. If you have too simple minded loops some parts may be rearranged or deleted.
function test(n, n_iter)
A = rand(n); % create some data
B = rand(n);
T = zeros(n_iter, 2);
for k = 1:n_iter
% check current scaling factor
T(k, 1) = find_max_freq * 1e-6; % save before
C = A *
B;
% do some computing
T(k, 2) = find_max_freq * 1e-6; % save after
end
% print a table
disp(' GHz GHz')
fprintf('%4.1f %4.1f\n', T')
function freq = find_max_freq
load /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq
f = scaling_cur_freq;
load /sys/devices/system/cpu/cpu1/cpufreq/scaling_cur_freq
f(2) = scaling_cur_freq;
load /sys/devices/system/cpu/cpu2/cpufreq/scaling_cur_freq
f(3) = scaling_cur_freq;
load /sys/devices/system/cpu/cpu3/cpufreq/scaling_cur_freq
f(4) = scaling_cur_freq;
freq = max(f);
Here is a test run in Matlab (version R2009b):
>> test(50, 15)
GHz GHz
1.2 1.2
1.2 1.2
1.2 1.2
1.2 1.2
1.2 1.2
3.2 3.2
3.2 3.2
3.2 3.2
3.2 3.2
3.2 3.2
3.2 3.2
3.2 3.2
3.2 3.2
3.2 3.2
3.2 3.2
Note that it takes time (iterations) for the frequency to increase. Very small matrices (short execution time) may never give an increase frequency.
The student Linux system has a command, /usr/sbin/cpuspeed,
that can be used to change the frequency, but the command requires root
(superuser) privileges (corresponds to administrator privileges in the
Windows world). If you have your own Linux machine you can try the
command. Read http://carlthompson.net/Software/CPUSpeed for more details.