General rules

Benchmarks

When writing benchmarks to measure the performance of compilers and computers it is very easy to be fooled; some examples follow below.

An optimizing compiler will remove redundant expressions from loops. Suppose you would like to measure the time of floating point operations or elementary functions.

t = fsecond()
do j = 1, n
  s = s + x * y
end do
t1 = fsecond() - t

t = fsecond()
do j = 1, n
  si = sin(x)
end do
t2 = fsecond() - t

In the loops above s and si will be computed once. It is necessary to change the arguments to get accurate measurements.

Constant expressions will be evaluated once.

Consider the following (now very old; but it fooled me once) Fortran77 program which computes a partial sum of the harmonic series (1/1 + 1/2 + ... + 1/1000000000):

      program fooled_me
      double precision  s, fsecond, t
      integer           j

      t = fsecond()
      s = 0.0d0
      do j = 1, 1000000000
        s = s + 1.0d0 / j
      end do
      t = fsecond() - t
      print*, 'Time = ', t

      end

On an old 140 MHz Sun and compiling with f77 -fast this takes 2 micro seconds, which is very impressive since this would give roughly 1 Pflops (10^15, P is for Peta which comes after Mega, Giga, Tera) which is not bad for a computer with a theoretical top speed of 6.4 million divisions per second (one division takes 22 cycles). Obviously something is very wrong.

If one studies the output from the assembler (compile using f77 -S -fast) it can be seen that the whole loop has been removed. This is quite reasonable from the compilers point of view . We never make use of the variable s after the loop, so why compute its value. If we do use s, by adding the line print*, s after loop we will get a completely different time. In fact it takes 159 seconds which is what we expect (1000000000 divisions at 6.4e6 per seconds gives 156 seconds).

So, print out (part of) what you compute. If you compute a vector print the first and last elements for example.

Now consider the following code which adds two vectors:

      subroutine add_vectors( x, y, z, n )
      implicit none
      integer           n, j
      double precision  x(n), y(n), z(n)

      do j = 1, n
        z(j) = x(j) + y(j)
      end do

      end

To get accurate times we repeat the loop a 1000 times. In the first loop below we do not call the routine, we just use the loop. In the second loop we call the routine, and the routine is placed in the same file as the loops. In the third loop we call an add-routine (identical with the one above but with a different name) lying in a separate file, i.e. we compile the main program and the add-routine separately.

The first two loops take about 0.02 s each and the third takes 12 s. n was 100000.

      t = fsecond()
      do k = 1, 1000  ! to get accurate times
        do j = 1, n
          z(j) = x(j) + y(j)
        end do
      end do
      t = fsecond() - t

      t = fsecond()
      do k = 1, 1000  ! to get accurate times
        call add_vectors( x, y, z, n )
      end do
      t = fsecond() - t

      t = fsecond()
      do k = 1, 1000  ! to get accurate times
        call add_vectors_ext( x, y, z, n )
      end do
      t = fsecond() - t

Place routines which are to be tested in a separate file and not in the main program.

If you are sharing a computer with other users, or of you are running other programs in parallel with your benchmark program your timings will not be very accurate..

If the thing you are testing takes very little time your timings will not be very accurate. You usually have to use a loop to repeat the computation many times. But don't let the compiler fool you. If you have too simple minded loops some parts may be rearranged or deleted.