There are six assignments in total, four of which are mandatory. Solutions of Assignments 4 and 5 count towards you examination. The first due date is rather late after the start to provide sufficient time to get acquainted with HPC. Do not underestimate the time it takes to solve an assignment.

Please consider the submission instructions.

Name Due Date Description
development environment Mo, September 17 Basic development tools.
optimization Mo, October 1 Performance measurement, diagnosis, and basic optimization.
threads Mo, October 15 POSIX threads.
openmp Fr, October 26 OpenMP, a compiler parallelization framework.
opencl (optional) Fr, October 26 Framework for heterogeneous computing environment (GPU, FPGA, etc.).
mpi (optional) Tu, October 30 Message passing interface for distributed computing.

All assignment texts are by Martin Raum, CC BY-SA 4.0.

Submission instructions

There are two types of assessments. I will assess directly your solutions to Assignments 2 and beyond. For them your program’s performance is the key objective. You handle Assignments 0 and 1 yourself as peer assessments. The focus of peer assessed assignments is to enforce learning of basic techniques that you will need for the remaining assignments.

Peer assessment

For peer assessment, first solve the assignments independently. Then get together in groups of three to check and possibly improve your solutions in group work. You can then come forward to me or Johannes during the computer lab sessions. After possibly asking some questions for cross-check, we will register your passing the assignment.

Performance assessment

Send your solutions to all assignments that I assess to me by mail with subject “[HPCAssignment]” on the due day at the lastest. Delayed submissions will not be accepted. After submission you will receive feedback on your assignment and possibly a list of corrections that you have to make before you pass the assignment. Make sure before submitting that your solutions passes the check scripts that I provide.

Submission groups

For Assignments 2 and beyond you are allowed and encouraged to work on and submit assignments in groups of two or three students. You have to sign up for a group tag that I will assign for the purpose of tracking your submissions. For example, hpcgp001 would be such a tag. You send email to me with subject “[HPCGroup]” with a list of all usernames on gantenbein that be part of your group. I will answer with a group tag. For instance, if you have accounts hpcuser021, hpcuser043, and hpcuser056 that wish to work together as a group, then send me a mail

hpcuser021, hpcuser043,  hpcuser056

and I would, for example, reply by


Implementations and source code

Programming language

Do not diverge from the choice of programming language that is indicated for each of the exercises. If no choice is suggested, then it is mandatory to use C.

Makefile build

You have to include a makefile, whose standard target builds all required exectuables. For example, if you have to provide exectuables main1 and main2, then the following must work


Makefile clean

The makefile must also provide a clean target that shall remove all files resulting from compilation, but no more. For example, if you provide two files, main.c and makefile, and running the exectuable main yields another file called outputfile, then

make clean

should preserve main.c, makefile, and outputfile, but delete main.o (if created) and main.

Build system

Your code must compile and run on gantenbein.


In addition to the source code for your solution, you have to write a report in (kramdoc compatible) markdown that explains your solution, with focus on the program layout and performance characteristics. You shall also answers all questions posed in context of the assignment, if there is any.


Markdown is a markup language which is geared towards easy typing and reading. If not familiar with it, you can find a summary of most important commands.

You can test whether you file compiles propertly with kramdown by running

kramdown -o html > report.html

Submission archive

Your solution to assignments should be packaged as a tar.gz-archive named by your group tag in lower case letters. For example, if your tag is hpcgp001, then your submission should contain one tar.gz-archive hpcgp001.tar.gz.


The root directory of the submission tar.gz-archive should contain the report as file called If you include any addition material (like pictures and graphs) in your report, then this should be contained in the folder reportextra. For example, if hpcgp001 is your group tag, then

tar xf hpcgp001.tar.gz
kramdown -o html > report.html

should create an html-file from your report.


The following example assignment is unrelated to any of the actual assigments. The purpose of this example is to illustrate concretely what aspects you should treat in your report.

Tasks description

Implement a program called randomtyping that

You can freely assume that the user input will yield machine size integers.

Submission archive

An implementation of this can be found in this following tar-archive.


The report, also contained in the above submission archive, addresses the program layout first and then performance. For this very small program, the program discussion is comparatively detailed, essentially discussion every single line. In your actual submission it suffices to discuss blocks of code that serve a given purpose (semantic blocks). The performance evuluation is meant humourous, but reveals the essential features of an actual discussion: Identify what you believe (or know) are the limiting factors to performance and elaborate on alternative implementations. A comparison of benchmarks as in the example is not required, but can be of great value also to your developing the program.

Development machine: Gantenbein

Back to the overview of assignments

Gantenbein is a computer at the department of mathematics. You must have access to check and submit your assignments. I will give you an account after you have registed in PingPong. Come and see me during the first lab session.

If you, for instance, have the user name hpcuser023, you log into gantenbein via


After that you have a usual terminal to work in. Gantenbein does not allow for a graphical user interface, like all high performance computing environments. Use the terminal as explained in the first lectures.

Editing files on gantenbein

There are two standard text editors installed on gantenbein: NeoVim and Emacs. For beginners I very much recommend to use Emacs. You can open a file for editing by


Navigation within a file works, for example, with key arrows. To write a file to disc press CTRL-x CTRL-s. To close a file press CTRL-x CTRL-c.

Transfering files to and from gantenbein

Should you want to edit files on a different system, you will need means to transfer them to and from gantenbein. I will not give you support on this, since the officially recommended way is to work on gantenbein directly.

Here, though, is some information for your convenience: The most confortable way is to use sshfs, which is available to you if you use your own Linux, Mac, or Windows computer. It connects a folder on gantenbein with a folder on your computer in a persistent way. To transfer individual folders or files, you can use the command scp on Linux and Mac. The user manual can be retrieved by the man command, as explained in the lectures.

If you decide to take files on your own computer, but do not have a suitable code editor, Atom is a popular one that runs on Linux, Mac, and Windows. Yet again, though, I will not provide support for running this on your system.

Assignment 0: Development Environment and basic programming

Back to the overview of assignments

The goal of this pre-assignment is to practice the basic use of the terminal environment and make. This is subject to peer-assessment.

Loging in and setting up a terminal environment

Make sure to revisit from the lectures’s slides how to:

Flint and Arb

Arb is a library for arbitrary precision ball arithmetic. It builds up on Flint, a library for fast number theoretic comuputation. The next goal is to provide local installations of both on gantenbein. You will need to install Flint first and then Arb.

Acquiring the code

Many programs that one uses in an HPC environment are not provided by the system. You have to find them yourself, often on the web. Make sure to:

Using Git

The following is optional for those, who wish to train the use of git.

Installing Flint

To build and install Flint, find the readme file in the Flint source folder and follow the instructions. When configuring the build make sure that:

Installing Arb

Adjust the steps that you follows for installing Flint to Arb. In particular, you will have to find a way how to inform the Arb configuration about where you have installed Flint. Make sure that:

Testing your installation

Flint comes with a small set of tests. Find out how to build and run them.

Arb comes with a small set of examples.

Basic C programming

The basic layout of C programs is similar to other programming languages, Java for example. A reference can be found at It is important to be familiar with the syntax of if, for, and while statements, which are merely a syntactic variation of what can be found in other programming languages. The next goal is to get familiar with the basic pattern of stack allocation, heap allocation, and file access.

Stack allocation

Allocation describes the reserving of memory for a program. Stack allocation is the fastest variant. While the C standard makes no assumption on what memory is allocated on the stack, it is reasonable to assume that arrays of static size are allocated on the stack.

Write a program employing the following allocation pattern

#define SIZE 10

int as[SIZE];
for ( size_t ix = 0; ix < SIZE; ++ix )
  as[ix] = 0;

printf("%d\n", as[0]);

Make sure that you place the declaration of the array as the main function (as opposed to global scope).

Compile and run the program with optimization level 1 on gcc. Then increase the size of the arrary, recompile and run, until the program fails with a segmentation fault. Explain this behavior assuming that that allocation was performed on the stack.

Heap allocation

Heap allocation is the alternative to stack allocation. It is generally slower, but more flexible. In particular, it has less restrictions on the size of allocated memory.

Write a program with the following allocation pattern

#define SIZE 10

int * as = (int*) malloc(sizeof(int) * SIZE);
for ( size_t ix = 0; ix < SIZE; ++ix )
  as[ix] = 0;

printf("%d\n", as[0]);


Compile and run as above, and verify that the program does not fail for sizes that triggered a segmentation fault before.

Reducing memory fragmentation

When allocating memory on the heap, there is no guarantee about where it is positioned. A common strategy to avoid this is to allocate memory in large blocks (contiguous memory). Implement a program based on the next two code snippets. Explain the meaning of all variables, and exhibit the possible layout of allocated memory in both cases.

Not avoiding memory fragmentation:

#define SIZE 10

int ** as = (int**) malloc(sizeof(int*) * SIZE);
for ( size_t ix = 0; ix < SIZE; ++ix )
  as[ix] = (int*) malloc(sizeof(int) * SIZE);

for ( size_t ix = 0; ix < SIZE; ++ix )
  for ( size_t jx = 0; jx < SIZE; ++jx )
    as[ix][jx] = 0;

printf("%d\n", as[0][0]);

for ( size_t ix = 0; ix < SIZE; ++ix )

Avoiding memory fragmentation:

#define SIZE 10

int * asentries = (int*) malloc(sizeof(int) * SIZE*SIZE);
int ** as = (int**) malloc(sizeof(int*) * SIZE);
for ( size_t ix = 0, jx = 0; ix < SIZE; ++ix, jx+=SIZE )
  as[ix] = asentries + jx;

for ( size_t ix = 0; ix < SIZE; ++ix )
  for ( size_t jx = 0; jx < SIZE; ++jx )
    as[ix][jx] = 0;

printf("%d\n", as[0][0]);


Writing to files

The C file model is based on FILE pointers, which have to be acquired and released. Check on the web the syntax and meaning of the following functions:

Implement a program that

How does your choice of memory allocation (contiguous vs. noncontiguous) impact the possibilties to write the matrix?

Parsing command line arguments

Programs can be called with command line arguments. For example, a program printargs might be called as

./printargs -a2 -b4

These arguments are passed to the main function

    int argc,
    char *argv[]

as an array of null-terminated strings argv (argument vector), the number of which is argc (argument count). In the above example the argument count is 3 and the argument vector accordingly contains 3 strings “./printargs”, “-a2”, and “-b4”.

Make sure that you are familiar with the concept of null-terminated strings and stdout and check on the web the syntax and meaning of the following functions and expressions:

Implement a program that

For example calling

printargs -a2 -b4


printargs -b4 -a2

should output

A is 2 and B is 4

As a final remark observe that standard programs would equally accept “-a2” and “-a 2”. When using standard solutions like POSIX getopt, this is automatically taken care of.

Learning summary

This assignment is a preparation towards your working on Assignments 2 through 5. In particular, you will need the following to facilitate your course studies:

Assignment 1: Optimization

Back to the overview of assignments

The goal of the first assignment is to practice how to time and profile code, and to inspect some aspects that contribute to performance. This is an assortment of various, independent tasks that are subject to peer assessment. It is allowed and suggested to complete each task, and then consult with your peer assessment group for feedback before starting the next task. Only after completing all tasks, however, you need to register your passing with me or Johannes.


In this task you familiarize yourself with a naive approach to timing and benchmarking. Familiarize yourself with

Implement a program that

Compile and run the program with all possible optimization levels (-O0 -O1 -O2 -O3 -Os -Og). Also generate assembler code for each of the optimization levels. Discuss the timing results in light of that assembler code.


In this task you will see the effect of automatic inlining. You have to compile programs with optimization level 2 (i.e. compiler flag -O2) in order to produce theeffect.

Implement a function with signature

mul_cpx( double * a_re, double * a_im, double * b_re, double * b_im, double * c_re, double * c_im)

that computes the product of two complex numbers b and c, and stores it in a. The suffices re and im stand for the real and imaginary part of a,b,c respectively.


Write a C program, called locality, that

Use the following naive implementations of row and column sums:

  double * sums,
  const double ** matrix,
  size_t nrs,
  size_t ncs
  for ( size_t ix=0; ix < nrs; ++ix ) {
    double sum = 0;
    for ( size_t jx=0; jx < ncs; ++jx )
      sum += matrix[ix][jx];
    sums[ix] = sum;

  double * sums,
  const double ** matrix,
  size_t nrs,
  size_t ncs
  for ( size_t jx=0; jx < ncs; ++jx ) {
    double sum = 0;
    for ( size_t ix=0; ix < nrs; ++ix )
      sum += matrix[ix][jx];
    sums[jx] = sum;

With an eye to memory access patterns

To conclude

Indirect addressing

In this task you get familiar with the impact of indirect addressing. It is frequently used for sparse matrix and vector operators.

Write a program that

Concretely, implement, as a program base on the following code:

n = 1000000; m = 1000;
for (kx=0; kx < n; ++kx) {
  jx = p[kx];
  y[jx] += a * x[jx];

Benchmark two variants of the program for some fix vectors x and y (any kind of entries will do), and some nonzero scalar a, where p is generated by either

ix = 0;
for (jx=0; jx < m; ++jx)
  for (kx=0; kx < m; ++kx)
    p[jx + m*kx] = ix++;


for (ix=0; ix < n; ++ix)
  p[ix] = ix;

Provide and benchmark an alternative implementation that accesses x and y directly, avoiding indirect addressing.

Benchmark with optimization level 0 and 2 in all cases. Compare the timings, and explain the difference.

Writing to HDD and to SSD

In this task you familiarize yourself with the performance difference between HDD (Hard Disc Drive) and SSD (Solid State Drive).

Home directories on gantenbein are mapped to a RAID 5 configuraton of HDDs. In addition there is an SSD available at /run/mount/scratch, which is available to all users. When accessing /run/mount/scratch, create a folder hpcuserXXX with XXX your hpc user number on gantenbein.

Next, in the terminal,

Discuss all timing results and speculate about the general efficient use of HDD and SSD.


In this task you familiarize yourself with some basic use of valgrind.

Implement a program with the following code:

int * as = (int*)malloc(10*sizeof(int));
int sum = 0;

for ( int ix = 0; ix < 10; ++ix )
  as[ix] = ix;

for ( int ix = 0; ix < 10; ++ix )
  sum += as[ix];


Compile the program with optimization level 0 and with the flag -g, then run the program with valgrind memcheck.

Repeat the compilation and running with valgrind after making one of the following modifications:

Discuss the respective valgrind output.


In this task you familiarize yourself with a very basic use of the debugger GDB.

Implement a program based on the following code

int * as = NULL;
int sum = 0;

for ( int ix = 0; ix < 10; ++ix )
  as[ix] = ix;

for ( int ix = 0; ix < 10; ++ix )
  sum += as[ix];


Compile the program (here, refered to as invalid_access) with optimization level 0 and with the flag -g. Run the following code in the shell:

gdb invalid_access

This starts the debugger gdb. Follow the next steps:

Note that when you need to pass arguments to your program you can invoke GDB as follows:

gdb --args PROGRAM ARGS

Learning Summary

Like Assignment 0, this assignment is a preparation towards your working on Assignments 2 through 5. In particular, you will need the following to facilitate your course studies:

Assignment 2: Threads

Back to the overview of assignments

We will use Newton’s method to practice programming with POSIX threads. Give a function f, say f(x) = x^3 - 1, where x^3 denotes the third power of x, Newton’s method iteratively strives to find a root of f. Starting with a value x_0, one sets x_{k+1} = x_k - f(x_k) / f’(x_k). Convergence of this is unclear, and even if it does converge, it is unclear what root of f is approached.

Here is an example run of Newton’s method for three different (complex) values of x_0.

  real value complex value complex conjucate of previous one
x_0 4.00000000000000 -3.00000000000000 + 1.00000000000000i -3.00000000000000 - 1.00000000000000i
x_1 2.68750000000000 -1.97333333333333 + 0.686666666666667i -1.97333333333333 - 0.686666666666667i
x_2 1.83781773931855 -1.25569408339907 + 0.505177535283118i -1.25569408339907 - 0.505177535283118i
x_3 1.32390198935733 -0.705870398114610 + 0.462793270024375i -0.705870398114610 - 0.462793270024375i
x_4 1.07278230395299 -0.284016534261040 + 0.737606186335893i -0.284016534261040 - 0.737606186335893i
x_5 1.00482620484690 -0.585120941952759 + 0.849582155034797i -0.585120941952759 - 0.849582155034797i
x_6 1.00002314326798 -0.501764864834321 + 0.859038339628915i -0.501764864834321 - 0.859038339628915i
x_7 1.00000000053559 -0.499955299939949 + 0.866052541624328i -0.499955299939949 - 0.866052541624328i

One can experimentally find which points on the complex plane converge to which root. Below there are two pictures: In the left one every pixel is colored in such a way that its color corresponds to the root which it converges to. In the right hand side, every pixel is colored in such a way that its color corresponds to the number of iterations which are necessary to get close to a root of f. Note that this computation can be carried out for each pixel in parallel. There is no interdependence among the iterations.

You can find more images in the the folder


on gantenbein.


Implement in C using POSIX threads a program called newton that computes similar pictures for a given functions f(x) = x^d - 1 for complex numbers with real and imaginary part between -2 and +2. The program should write the results as PPM files (see the description of PPM below) to the current working directory. The files should be named newton_attractors_xD.ppm and newton_convergence_xD.ppm, where D is replaced by the exponent of x that is used.

Your program should accept command line arguments

./newton -t5 -l1000 7
./newton -l1000 -t5 7

The last argument is d, the exponent of x in x^d - 1. The argument to -l gives the number of rows and columns for the output picture. The argument to -t gives the number of threads used to compute the pictures. For example, the above command computes images of size 1000x1000 for the polynomial x^7 - 1. It uses 5 threads for the Newton iteration.

Recall from the last part of assignment 0 that command line arguments can, in general, be passed in arbitrary order. The degree is a positional argument, i.e. it will be the last argument whenever the program newton is invoked.

You can find an example program at

/home/hpc2018/a2_example/newton -t2 -l1000 5

Implement details: Iteration

You have to run the iteration for each point until either of these criteria is met.

Implement details: Writing to file

PPM files

PPM (Portable PixMap) files are an ASCII based image format. You can write and read them as a text file. Wikipedia contains a description with examples. In this assignment the header must look like


where L is the number of rows and columns and M is the maximal color value used in the body of the file.

Implementation details: Simplifications

Implementation strategy

The following is a possible implementation strategy. While it is not mandatory to follow it, I highly recommend any student without extensive programming experience to adhere to it.

Splitting the assignment into smaller tasks

This assignment comprises four largely independent tasks:


You can adjust the code from Assignment 0 to this end. Use global variables to store the number of threads and lines to make available to all threads.


A framework for synchronizing compute and write threads was presented in the lecture.

When finishing these steps you have functional synchronization facilities as required for the assignment. It is helpful to

Data transfer

Recall that the result of a single computation in the lecture’s code was represented by an array

int * result;

Since items correspond to rows in this implementation, each result has to include the attractor index and the convergence index for each pixel in a row.

Implement, allocate, and free two global arrays

TYPE_ATTR ** attractors;
TYPE_CONV ** convergences;

and corresponding local arrays in the computation and writing threads:

TYPE_ATTR * attractor;
TYPE_CONV * convergence;


It is advantagous to implement the writing step prior to the computation step, since you will have accessible feedback once the writing is in place. To guarantee meaningful input values in the writing thread add the following initialization to your computing threads:

for ( size_t cx = 0; cx < nmb_lines; ++cx ) {
  attractor[cx] = 0;
  convergence[cx] = 0;

Make sure that the attractor and convergence value 0 indeed is assigned meaning in your encoding. Most likely this is the case, but it is important to check these assumptions to avoid unnecessary searching for errors in your code.

One might be tempted to implement the writing as

fprintf(attr_file, "%d %d %d ", COLOR_VALUE0, COLOR_VALUE1, COLOR_VALUE2);
fprintf(conv_file, "%d %d %d ", GRAY_VALUE, GRAY_VALUE, GRAY_VALUE);

but recall that fprintf is slow; fwrite is a better solution.


Before providing code for the iteration, the checking should be in place to guarantee that each step of the computation is carried out under the valid assumption that the norm of x be not too small.

There are three kinds of checks:

When implementing the arithmetic part of this step, it is helpful to first consider two questions:

The checking for the d+2 states (d the degree of the given polynomial) inside of a repeated computation like the iteration in this assignment requires an adapted programming pattern. Consider one similar to the following:

for ( conv = 0, attr = DEFAULT_VALUE; ; ++conv ) {
    attr = VALUE;
    attr = VALUE;
  for ( EXPRESSION )
    if ( CHECK CONDITION ) {
  if ( attr != DEFAULT_VALUE )

  // computation

To understand this code snippet, read up on the syntax of


It remains to implement the computation in order to complete the program. Since you can use functionality from complex.h and you can hardcode your formula, this step is largely about finding an expression for the iteration step that is as efficient as possible. Once more, recall that using cpow is not an option.

When hardcoding the expression that you derive from this, the following syntax is convenient:

switch ( degree ) {
  case 1:
  case 2:
  case 3:
  // insert further cases

    fprintf(stderr, "unexpected degree\n");

To understand this code, read up on

Test script

You can check your submission with the script

/home/hpc2018/a2_grading/ PATH_TO_SUMISSION.tar.gz

For instance, if you have provided your submission in the file /home/hpcuser023/newton.tar.gz, then you run

/home/hpc2018/a2_grading/ /home/hpcuser023/newton.tar.gz

Recall that your submission has to pass this script for you to pass this assignment. So test before submitting.

Performance goals

The program will be tested with the following arguments. It must be at least as fast as given in the corresponding tables.

Single threaded, 1000 lines, varying polynomial x^d - 1

Degree of polynomial 1 2 5 7
Maximal runtime in seconds 1.01 1.48 1.52 1.64

Multithreaded, 1000 lines, x^5 - 1

Number of threads 1 2 3 4
Maximal runtime in seconds 1.52 0.70 0.55 0.42

Ten threads, varying number of lines, x^7 - 1

Number of lines 1000 50000
Maximal runtime in seconds 0.26 594

Going into depth: Further questions to consider

This assignment presents plenty of opportunities for improvment, both from the algorithmic point of view and the technical one. Here are some inspirations for further exploration.

Assignment 3: OpenMP

Back to the overview of assignments

In this assignment, we count distances between points in 3-dimensional space. One possibility to think of this is as cells, whose coordinates were obtained by diagnostics and which require further processing.

Distances are symmetric, i.e. distance between two cells c_1 and c_2 and the distance in reverse direction c_2 to c_1 should be counted once.


Implement a program in C, called cell_distance, that uses OpenMP for parallelization and that:

Your program should accept command line arguments

./cell_distance -t5

The argument to -t give the number of OpenMP threads that you may use. You may only use OpenMP parallelism.

You can find an example program at

/home/hpc2018/a3_example/cell_distance -t10


Coordinates are given in a file, in which each row corresponds to one cell. The three coordinates are separated by exactly one blank space. Each coodinate is given exactly as follows: A sign ‘+’ or ‘-‘, two digits, a decimal point ‘.’, and three digits. For instance, the following is a valid input of 10 cells.

+01.330 -09.035 +03.489
-03.718 +02.517 -05.995
+09.568 -03.464 +02.645
-09.620 +09.279 +08.828
+07.630 -02.290 +00.679
+04.113 -03.399 +05.299
-00.994 +07.313 -06.523
+03.376 -03.614 -06.657
+01.304 +09.381 -01.559
-04.238 -07.514 +08.942


A valid output to the above input is

03.00 1
05.54 1
05.85 1
05.91 1
06.07 1
06.54 1
07.94 1
08.58 1
09.40 1
09.59 1
09.65 1
09.98 1
10.00 1
11.18 1
11.68 1
11.77 1
11.98 1
13.46 1
14.02 1
14.11 1
14.77 1
14.78 1
14.96 1
15.07 1
15.38 1
15.71 1
15.78 1
15.84 1
16.75 1
16.94 1
17.33 1
17.63 1
17.66 1
17.72 1
17.79 1
18.00 1
19.02 1
19.10 1
19.31 1
20.65 1
21.67 1
22.00 1
22.31 1
23.85 1
23.98 1

You can find further test data in the folder


Reference implementation

A reference implementation in Julia is available at


Implementation details: Simplifications

Implementation details: Memory consition

You do not have to proof that this is the case, but your report has to contain an arugment why your implementation achieves this.

Implementation details: Precision

In case of doubt use the test script from below.

Implementation strategy

The following is an outline of aspects that play into designing an efficient implementation.

Splitting the assignment into smaller tasks

This assignment comprises four tasks. As opposed to Assignment 2, they are not independent, but design choices influence each other.

Computing the distance of two points

Here is a list of observations and questions that can help you designing a good solution to this subtask.

Parsing the file

When parsing the file or parts of it, it is a huge advantage to benefit from the fixed input format. The following observations and questions can guide your implementation.

Memory management

There are three aspects of memory management that you have to balance. Recall that block iteration was a technique you saw in the lectures.


Parallelization can enter at several points of the program. Parsing and counting distances are the two most beneficial ones.

Test script

You can check your submission with the script

/home/hpc2018/a3_grading/ PATH_TO_SUMISSION.tar.gz

Notice that the test script

  1. checks your output for correctness,
  2. redirects stdout to a file, so that additional runtime costs for writing to disc can apply.

Performance goals

Number of points 1e4 1e5 1e5 1e5
Number of threads 1 5 10 20
Maximal runtime in seconds 0.41 8.2 4.1 2.6

Assignment 4: OpenCL

Back to the overview of assignments

A simple model for heat diffiusion in 2-dimensional space, splits it up into n_x times n_x small boxes. In each time step, the temperature h(i,j) of a box with coodinates i and j is updated as

h(i,j) + c * ( (h(i-1,j) + h(i+1,j) + h(i,j-1) + h(i,j+1))/4 - h(i,j) )

where c is a diffision constant. We consider the case of bounded region of box shape whose boundary has constant temperature 0.


Implement in C using the GPU via OpenCL a program heat_diffusion that

  1. Initializes an array of given size to 0 except for the middle point whose initial value is determined by a parameter -i.
    • If the number of row or columns is even, divide up the initial value between the 2 or 4 central points.
  2. Executes a given number of steps of heat diffusion with given diffusion constant, and
  3. Outputs the average a of temperatures.
  4. Outputs the average absolute difference of each temperature to the average of all temperatures.

Your program should accept command line arguments

./heat_diffusion 1000 100 -i1e10 -d0.02 -n20

to compute with boxes of size width 1000, height 100, initial central value 1e10, diffusion constant 0.02, and 20 iterations. You may only use OpenCL parallelism. You find an example program at

/home/hpc2018/a4_example/heat_diffusion 1000 100 -i1e10 -d0.02 -n20


As an example, we consider the case of 3 times 3 boxes with diffusion constant 1/30. Their initial heat values written in a table are

0 0 0
0 1,000,000 0
0 0 0

Then the next two iterations are

0 8333 0
8333 9667e2 8333
0 8333 0


138.9 1611e1 138.9
1611e1 9347e2 1611e1
138.9 1611e1 138.9

In particular, the average temperature is 111080 (as opposed to the original average 111111).

The absolute difference of each temperature and the average is

1109e2 9497e1 1109e2
9497e1 8236e2 9497e1
1109e2 9497e1 1109e2

with average is 183031.55. After five further iterations this will decrease to 151816.97.

Performance goals

Width * Height 100 * 100 10000 * 10000 100000 * 100
Initial value 1e20 1e10 1e10
Diffusion constant 0.01 0.02 0.6
Number of iterations 100000 1000 200
Time in seconds 1.7 98 1.4

Assignment 5: MPI

Back to the overview of assignments

Dijkstra’s algorithms finds shortest paths in graphs with directed, weighted edges. In the following example of a symmetric (i.e. undirected) graph is, a shortest path from node 2 to node 4 would go via node 1 and node 7.

An example of a directed graph would be the following triangle, in which shortest pathes between nodes are unique.

One can implement Dijkstra’s algorithm via MPI: Every worker gets assigned a fixed set of nodes. In every iteration, each worker updates its unvisited neighbore to the current node. The minimal tentative distance among all nonvisited nodes can also be determined by a distributed computation.

Using MPI on Gantenbein

MPI on Gantenbein is organized as a module. Before using it run

module load mpi


Implement the described algorithm as a program called dijkstra in C using MPI. The program should output the length of a shortest path in the graph from a given source to a given destination.

mpirun -n 1 dijkstra 2 4 example_graph

You can call an example program via

mpirun -n 1 /home/hpc2018/a5_example/dijkstra 1 17 /home/hpc2018/a5_grading/test_data/graph_de1_ne3_we2

Graph input file

Your program should read the graph from a given file with lines of the form

i1 i2 w

where i1 and i2 are indices of vertices connected by an edge of weight w. The first two columns are the lable of a vertices, a positive integer, and the third one is a weight, an integer greater than 0. The above graph would, for example, yield

6 4 51
5 2 81
0 5 14
5 0 14
6 1 55
2 7 3
7 5 98
3 4 82
3 0 59
6 3 87
1 4 61
5 7 98
7 2 3
2 5 81
1 6 55
4 3 82
0 2 34
0 3 59
3 6 87
4 1 61
1 7 31
4 6 51
2 0 34
7 1 31

Note that the this graph has undirected edges, so that the table of weights is symmetric.

Examples of Graphs

You can find a list of test graphs at


There is also a python script, that you can use to generate test cases for you convenience.

Test script

You can check your submission with the script

/home/hpc2018/a5_grading/ PATH_TO_SUMISSION.tar.gz

Performance goals

For the files that can be found at

Degree 10 100 100 100 1000
Number of vertices 1000 10000 10000 100000 100000
Maximal weight 100 100 100 100 1000
File name graph_de1_ne3_we2 graph_de2_ne4_we2 graph_de2_ne4_we2 graph_de2_ne5_we2 graph_de3_ne5_we3
Number of nodes 1 1 4 10 20
Start point 9 17 17 107 4
End point 82 40 18 0 5
Time in seconds 1.2 6.8 3.1 98 256