Benchmarks
When writing benchmarks to measure the performance of compilers and computers it is very easy to be fooled; some examples follow below.
An optimizing compiler will remove redundant expressions from loops. Suppose you would like to measure the time of floating point operations or elementary functions.
t = fsecond() do j = 1, n s = s + x * y end do t1 = fsecond() - t t = fsecond() do j = 1, n si = sin(x) end do t2 = fsecond() - t
In the loops above s
and si
will be computed
once. It is necessary to change the arguments to get accurate
measurements.
Constant
expressions will be evaluated once.
Consider the following (now very old; but it fooled me once) Fortran77 program which computes a partial sum of the harmonic series (1/1 + 1/2 + ... + 1/1000000000):
program fooled_me double precision s, fsecond, t integer j t = fsecond() s = 0.0d0 do j = 1, 1000000000 s = s + 1.0d0 / j end do t = fsecond() - t print*, 'Time = ', t end
On an old 140 MHz Sun and compiling with f77 -fast this takes 2 micro seconds, which is very impressive since this would give roughly 1 Pflops (10^15, P is for Peta which comes after Mega, Giga, Tera) which is not bad for a computer with a theoretical top speed of 6.4 million divisions per second (one division takes 22 cycles). Obviously something is very wrong.
If one studies the output from the assembler (compile using f77 -S -fast)
it can be seen that the whole loop has been removed. This is quite
reasonable from the compilers point of view . We never make use of the variable
s after the loop, so why compute its value. If we do use s,
by adding the line print*, s
after loop we will get a completely
different time. In fact it takes 159 seconds which is what we expect (1000000000
divisions at 6.4e6 per seconds gives 156 seconds).
So, print out (part
of) what you compute. If you compute a vector print the first and last elements
for example.
Now consider the following code which adds two vectors:
subroutine add_vectors( x, y, z, n ) implicit none integer n, j double precision x(n), y(n), z(n) do j = 1, n z(j) = x(j) + y(j) end do end
To get accurate times we repeat the loop a 1000 times. In the first loop below we do not call the routine, we just use the loop. In the second loop we call the routine, and the routine is placed in the same file as the loops. In the third loop we call an add-routine (identical with the one above but with a different name) lying in a separate file, i.e. we compile the main program and the add-routine separately.
The first two loops take about 0.02 s each and the third takes 12 s. n was 100000.
t = fsecond() do k = 1, 1000 ! to get accurate times do j = 1, n z(j) = x(j) + y(j) end do end do t = fsecond() - t t = fsecond() do k = 1, 1000 ! to get accurate times call add_vectors( x, y, z, n ) end do t = fsecond() - t t = fsecond() do k = 1, 1000 ! to get accurate times call add_vectors_ext( x, y, z, n ) end do t = fsecond() - t
Place routines which
are to be tested in a separate file and not in the main program.
If you are sharing
a computer with other users, or of you are running other programs in parallel
with your benchmark program your timings will not be very accurate..
If the thing you
are testing takes very little time your timings will not be very accurate.
You usually have to use a loop to repeat the computation many times.
But don't let the compiler fool you. If you have too simple minded loops
some parts may be rearranged or deleted.