next up previous index
Next: The Code Up: Basic MPI Previous: Exercises

   
Diffusion

In this section I will lay the foundation for an MPI version of a diffusion code based on Jacobi iterations. The idea is that a flat rectangular region, whose dimensions in this token example are going to be $6\times8$, will be divided amongst 12 processes, each handling a small square $2\times2$. But because it is a differential problem, apart from the $2\times2$ data squares, the processes will also have to maintain additional data that is going to mirror data that corresponds to whatever the neighbouring processes have in their $2\times2$ squares.

Diffusion is a very universal phenomenon that is encountered almost everywhere, beginning with microscopic world of quantum systems, and going all the way up to atmospheric circulation, pollution, and hurricanes. In the middle you encounter diffusion of nutrients through various membranes in cells or diffusion of oil through geological strata, or diffusion of heat through a combustion engine.

Mathematics of diffusion in all these systems is very similar and, at its simplest, it is described by the following partial differential equation:

\begin{displaymath}\frac{\partial T(x, y, z)}{\partial t} = D \nabla^2 T(x, y, z)
\end{displaymath}

where T(x, y, z) represents the diffusing quantity (e.g., it may stand for temperature), t stands for time, D is the diffusion coefficient and $\nabla^2$ is the Laplace operator, which looks  like this in Cartesian coordinates:

\begin{displaymath}\nabla^2 = \frac{\partial^2}{\partial x^2} + \frac{\partial^2}{\partial y^2}
+ \frac{\partial^2}{\partial z^2}
\end{displaymath}

Assuming that the problem is static, we get that $\frac{\partial T(x, y, z)}{\partial t} = 0$and then

 \begin{displaymath}
\nabla^2 T(x, y, z) = 0
\end{displaymath} (5.2)

too. This is called  the Laplace equation.

We can simplify the problem further, by assuming that it is only 2-dimensional, so that

\begin{displaymath}\nabla^2 T(x, y) = 0
\end{displaymath}

This equation has the following discretized equivalent on a regular rectangular lattice with spacing h for any lattice node (x0, y0):

 \begin{displaymath}
T(x_0, y_0) = \frac{1}{4}
\left(
T(x_0 + h, y_0) + T(x_0 - h, y_0)
+ T(x_0, y_0 + h) + T(x_0, y_0 - h)
\right)
\end{displaymath} (5.3)

In other  words, the value of T at (x0, y0) is the arithmetic average of T over the neighbours nearest of (x0, y0) (we don't take the diagonal neighbours into account, because they are not the nearest).

This provides us with a simple method of solving equation (5.2) for a situation with fixed boundary  conditions, i.e., we know T everywhere on the boundary of some 2-dimensional region and we want to find T everywhere within the region. The solution is to begin with an initial guess. Then use equation (5.3) to replace each T(x0, y0)with the average over the neighbours of (x0, y0). We may have to repeat this process over and over, until we'll notice that T no longer changes from iteration to iteration. This is going to be our numerical solution of equation (5.2). This very simple  method is due to a German mathematician Karl Gustav Jacob  Jacobi, who was born on the 10th of December 1804 in Potsdam near Berlin, and who died on the 18th of February 1851 in Berlin.

In order to carry out Jacobi iterations on a parallel system, you need to divide the two-dimensional region, in the simplest case a rectangle, into, say, squares, allocating each square to a separate process. For nodes on the boundaries of the squares, we will need to know values that function T assumes on nodes that belong to other processes, so that we can perform the averaging.

This is where MPI communication comes in. We will have to use MPI in order to transfer these values. We only need to transfer values associated with the boundaries of the squares. If each square comprises $N\times N$ nodes, then a process responsible for a given square will have to send $4\times N$ data items to other nodes for each iteration. The amount of data that needs to be communicated scales like N, whereas the amount of data that stays within the square and can be computed on without communication scales like N2. Assuming that computation is a thousand times faster than communication, we get a ``break-even'' for $N^2 = 1000 \times N$, which yields N=1000. For N much larger than 1000 the cost of communication will be negligible compared to the cost of computation. This means that the squares need to be very large. The faster the CPUs compared to the network, the more nodes you have to pack into each square to make the parallelization worthwhile.

The communication in this case can be facilitated further, if the processes responsible for adjacent squares can be made to run on ``adjacent'' processors. There are machines, such as, e.g., Cray X1 , whose internal network topology lets programmers do just this. Such machines are exceptionally well suited to solving field problems, of which diffusion is the simplest.

In this section you will learn how MPI provides various constructs that are very helpful in capturing this type of problem. But we are not going to solve the problem itself. Instead we will focus on the MPI mechanisms themselves and we will analyze a simple MPI demonstration program that explores these mechanisms.

In our simple example each process looks after an small $2\times2$ integer matrix, which represents the ``square'' and the integers represent function T. We are going to surround each $2\times2$ square with additional rows and columns around its boundary, into which we are going to transfer data from the adjacent squares. There will be a column on the left and on the right of the $2\times2$region to represent values obtained from adjacent squares on the left and on the right hand side, and rows above and below the $2\times$ region, to represent values obtained from adjacent squares above and below ``our'' square. So, in effect, each process will really have to work with a $4\times4$ matrix, but only $2\times2$ interior of this matrix is its own, and the periphery of the matrix will be populated with data obtained from other processes.

Within the example code we are going to:

1.
arrange processes into a virtual 2-dimensional Cartesian process topology
2.
orient processes within this new topology (so that they know who their neighbours are)
3.
exchange data with neighbouring processes within the Cartesian topology so as to populate the peripheral rows and columns in the $4\times4$ matrices.

At every stage the master process is going to collect data from all other processes and print it on standard output so as to show the state of the system at one glance.

In order to help you understand the whole procedure, I'm going to show you some of the output of the program first, then I'll list the program for you, and then discuss it in more detail.

When the program starts, right after the processes have formed the new Cartesian topology, oriented themselves within it, but before they exchanged any data, their state looks as follows:

  9  9  9  9  10 10 10 10  11 11 11 11 
  9  9  9  9  10 10 10 10  11 11 11 11 
  9  9  9  9  10 10 10 10  11 11 11 11 
  9  9  9  9  10 10 10 10  11 11 11 11 

  6  6  6  6   7  7  7  7   8  8  8  8 
  6  6  6  6   7  7  7  7   8  8  8  8 
  6  6  6  6   7  7  7  7   8  8  8  8 
  6  6  6  6   7  7  7  7   8  8  8  8 

  3  3  3  3   4  4  4  4   5  5  5  5 
  3  3  3  3   4  4  4  4   5  5  5  5 
  3  3  3  3   4  4  4  4   5  5  5  5 
  3  3  3  3   4  4  4  4   5  5  5  5 

  0  0  0  0   1  1  1  1   2  2  2  2 
  0  0  0  0   1  1  1  1   2  2  2  2 
  0  0  0  0   1  1  1  1   2  2  2  2 
  0  0  0  0   1  1  1  1   2  2  2  2
Every process initializes its own $4\times4$ matrix to its own rank number. The matrices are displayed by the master process in a way that illustrates the topology of the whole system, i.e., we have a rectangular topology $3\times4$, process rank 0 is in the lower left corner, and its neighbours are process rank 1 on the right and process rank 3 above. Process rank 4 sits roughly in the middle and its neighbours are process rank 1 below, process rank 7 above, process rank 3 on the left, and process rank 5 on the right.

Now, the processes exchange the content of their inner rows with their neighbours, i.e., process rank 4 sends the content of its row 3 (counted from the bottom) to process number 7, which puts it in its own row 1 (counted from the bottom), and, at the same time receives the content of row 2 from process rank 7, and places it in its own row 4.

So that after this exchange operation has taken place, the matrices look as follows:

  9  9  9  9  10 10 10 10  11 11 11 11 
  9  9  9  9  10 10 10 10  11 11 11 11 
  9  9  9  9  10 10 10 10  11 11 11 11 
  6  6  6  6   7  7  7  7   8  8  8  8 

  9  9  9  9  10 10 10 10  11 11 11 11 
  6  6  6  6   7  7  7  7   8  8  8  8 
  6  6  6  6   7  7  7  7   8  8  8  8 
  3  3  3  3   4  4  4  4   5  5  5  5 

  6  6  6  6   7  7  7  7   8  8  8  8 
  3  3  3  3   4  4  4  4   5  5  5  5 
  3  3  3  3   4  4  4  4   5  5  5  5 
  0  0  0  0   1  1  1  1   2  2  2  2 

  3  3  3  3   4  4  4  4   5  5  5  5 
  0  0  0  0   1  1  1  1   2  2  2  2 
  0  0  0  0   1  1  1  1   2  2  2  2 
  0  0  0  0   1  1  1  1   2  2  2  2
Now we have to repeat this operation, but this time exchanging columns between neighbours. And so process rank 4 will send its column 2 (counted from the left) to process rank 3, which is going to place it in its own column 4 (counted from the left), and, at the same time process rank 3 is going to send its own column 3 to process rank 4, which is going to place it in its own column 1.

And after all this is over, the matrices are going to look as follows:

  9  9  9 10   9 10 10 11  10 11 11 11 
  9  9  9 10   9 10 10 11  10 11 11 11 
  9  9  9 10   9 10 10 11  10 11 11 11 
  6  6  6  7   6  7  7  8   7  8  8  8 

  9  9  9 10   9 10 10 11  10 11 11 11 
  6  6  6  7   6  7  7  8   7  8  8  8 
  6  6  6  7   6  7  7  8   7  8  8  8 
  3  3  3  4   3  4  4  5   4  5  5  5 

  6  6  6  7   6  7  7  8   7  8  8  8 
  3  3  3  4   3  4  4  5   4  5  5  5 
  3  3  3  4   3  4  4  5   4  5  5  5 
  0  0  0  1   0  1  1  2   1  2  2  2 

  3  3  3  4   3  4  4  5   4  5  5  5 
  0  0  0  1   0  1  1  2   1  2  2  2 
  0  0  0  1   0  1  1  2   1  2  2  2 
  0  0  0  1   0  1  1  2   1  2  2  2
Observe that now not only does every process know what data is harboured by its neighbours on the left, on the right, above, and below, but they even know the data that somehow made it diagonally, i.e., from the upper left, upper right, lower left, and lower right directions - but the latter is not going to last, because that data came from the peripheral rows and columns in the adjacent squares, so it is not truly representative of the processes' internal states.

Now every process can perform its own Jacobi iteration within its own little patch for a diffusion problem, and set new values to its own data, while keeping the mirrored data unchanged.

Before the next iteration can commence, the data has to be exchanged between the neighbours again, in order to refresh the peripheries of the squares.

So now let us have a look at the code:



 
next up previous index
Next: The Code Up: Basic MPI Previous: Exercises
Zdzislaw Meglicki
2004-04-29