next up previous index
Next: Interacting Particles Up: The Diffusion Problem Previous: The Code

The Discussion

The code begins innocently enough like all other MPI codes that we have seen so far: MPI itself is initialised, then every process finds out about the size of the process pool, and its own rank within that pool.

   MPI_Init(&argc, &argv);
   MPI_Comm_size(MPI_COMM_WORLD, &pool_size);
   MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);

And then, as usual, one of the processes assumes the role of the master:

   if (my_rank == MASTER_RANK) i_am_the_master = TRUE;

But the next operation is entirely new:

   MPI_Cart_create ( MPI_COMM_WORLD, PROCESS_DIMENSIONS, divisions, 
                     periods, reorder, &cartesian_communicator );
What happens here is as follows. The processes create a new communicator, which, unlike the MPI_COMM_WORLD, has a special topology imposed on it. The new communicator, the place for which is passed to MPI_Cart_create as the last argument, is formed out of the old one, which is passed to MPI_Cart_create in the first argument. The four parameters passed in the middle: PROCESS_DIMENSIONS, divisions, periods, and reorder, specify the topology of the new communicator. When we say topology it means that there is some sort of a connectivity between the processes within this new communicator. In this case they will acquire left, right, bottom, and top neighbours.

The value of PROCESS_DIMENSIONS is 2, which means that the processes are to be organised into a 2 dimensional grid. divisions is going to be an array of rank 1, whose entries specify lengths of the process grid in those two dimensions, in this case:

divisions[PROCESS_DIMENSIONS] = {4, 3};
that is, the processes are going to be organised into a rectangular grid with 4 rows and 3 columns.

The parameter periods specifies if the topology is going to be a wrap-around topology or an open ended topology, i.e., if the processes that are on the borders of the region should have as their over the border neighbours the guys on the other side, or nobody. In this case we simply say that

periods[PROCESS_DIMENSIONS] = {0, 0};
which means that we don't want any wrapping.

The last parameter in this group, reorder, specifies if processes should be reordered for efficiency. This implies that processes may be moved around between processors, or just renumbered, so as to utilize better any hardware architecture that a given machine may have. Examples of such machines are Cray T3E, Fujitsu AP-3000, and NEC Cenju-3.

If you have asked for a $4\times 3$ Cartesian communicator and started with, say, 15 processes, then 3 of those 15 processes would have to be rejected from the new communicator, because there is only going to be enough space in it for 12 processes. The rejected processes will find that after MPI_Cart_create returns, the returned value of cartesian_communicator is MPI_COMM_NULL. What this means is that those processes didn't get the job.

They may hang about, if they choose, or they may go right for MPI_Finalize - this is up to the programmer. You may create more than one communicator with various properties, so that the processes that didn't make it into the cartesian_communicator will get jobs elsewhere.

For this reason the remainder of the program is one large if statement:

   if (cartesian_communicator != MPI_COMM_NULL) {
      blah... blah... blah...
   }

   MPI_Finalize ();
}

Now, assuming that you did get a job, as a process, with the cartesian_communicator, the first thing that you do is to find your new rank number within that new communicator:

      MPI_Comm_rank ( cartesian_communicator, &my_cartesian_rank );
And the reason for this is that your rank number in the cartesian_communicator is an entirely different thing from your old number in the MPI_COMM_WORLD. They're like a tax file number in Australia and a tax file number in the US.

But now, since the cartesian_communicator is a communicator with a topology, you can make inquiries about your neighbourhood:

      MPI_Cart_coords ( cartesian_communicator, my_cartesian_rank,
                        PROCESS_DIMENSIONS, my_coordinates );
      MPI_Cart_shift ( cartesian_communicator, SIDEWAYS, RIGHT, 
                       &left_neighbour, &right_neighbour );
      MPI_Cart_shift ( cartesian_communicator, UPDOWN, UP,
                       &bottom_neighbour, &top_neighbour );

The first call to MPI_Cart_coords returns your Cartesian coordinates in an array my_coordinates of length PROCESS_DIMENSIONS. Naturally you have to tell MPI_Cart_coords who you are, in order to obtain that information. You have to pass on your new Cartesian rank number in the second slot of the function.

The following two calls to MPI_Cart_shift tell you about the rank numbers of your left and right neighbours and of your bottom and top neighbours. So the function MPI_Cart_shift doesn't really shift anything. It's more like looking up who your neighbours are. But the designers of MPI thought of it in terms of shifting rank numbers of your neighbours from the left and from the right and from above and from below into those containers, which you call left_neighbour, right_neighbour, bottom_neighbour, and top_neighbour.

This stuff is worth displaying, so what we do now is to pack all that information into a message and send it to the master process.

Now we can commence the communication that occurs within the MPI_COMM_WORLD:

      if (! i_am_the_master ) {
         sprintf(char_buffer, "process %2d, cartesian %2d, \
coords (%2d,%2d), left %2d, right %2d, top %2d, bottom %2d",
                 my_rank, my_cartesian_rank, my_coordinates[0],
                 my_coordinates[1], left_neighbour, right_neighbour,
                 top_neighbour, bottom_neighbour);
         MPI_Send(char_buffer, strlen(char_buffer) + 1, MPI_CHAR,
                  MASTER_RANK, 3003, MPI_COMM_WORLD);
      } 
      else {

         int number_of_c_procs, count;

         number_of_c_procs = divisions[0] * divisions[1]; 
         for (count = 0; count < number_of_c_procs - 1; count++) {
            MPI_Recv(char_buffer, BUFSIZ, MPI_CHAR, MPI_ANY_SOURCE, 3003,
                     MPI_COMM_WORLD, &status);
            printf ("%s\n", char_buffer);
         }
         printf( "process %2d, cartesian %2d, \
coords (%2d,%2d), left %2d, right %2d, top %2d, bottom %2d\n",
                 my_rank, my_cartesian_rank, my_coordinates[0],
                 my_coordinates[1], left_neighbour, right_neighbour,
                 top_neighbour, bottom_neighbour);
      }
By now you should be able to figure out this part of the code for yourself, because we've done both MPI_Send and MPI_Recv.

Observe a subtle bug here: we are assuming that the master process is still going to hang around! In other words that the master process was included into the cartesian_communicator.

What will happen if it is not included?

In our case we are going to run this job requesting exactly the right number of processes, so the master process is guaranteed to be included into the cartesian_communicator. Having said that you may consider modifying the logic of the program so as to get rid of the bug.

Here is what the output looks like, once the master process gets down to it:

process  8, cartesian  8, coords ( 2, 2), left  7, right -3, top 11, bottom  5
process  4, cartesian  4, coords ( 1, 1), left  3, right  5, top  7, bottom  1
process  1, cartesian  1, coords ( 0, 1), left  0, right  2, top  4, bottom -3
process  5, cartesian  5, coords ( 1, 2), left  4, right -3, top  8, bottom  2
process 11, cartesian 11, coords ( 3, 2), left 10, right -3, top -3, bottom  8
process  9, cartesian  9, coords ( 3, 0), left -3, right 10, top -3, bottom  6
process  3, cartesian  3, coords ( 1, 0), left -3, right  4, top  6, bottom  0
process  7, cartesian  7, coords ( 2, 1), left  6, right  8, top 10, bottom  4
process  2, cartesian  2, coords ( 0, 2), left  1, right -3, top  5, bottom -3
process  6, cartesian  6, coords ( 2, 0), left -3, right  7, top  9, bottom  3
process 10, cartesian 10, coords ( 3, 1), left  9, right 11, top -3, bottom  7
process  0, cartesian  0, coords ( 0, 0), left -3, right  1, top  3, bottom -3
You should compare this with the matrix lay-out that I have printed above and you'll see that it all makes some sense.

Observe that cartesian rank numbers here are the same as the world rank numbers. But you must not count on it. It just happens to be so here, on the SP, for this particular program, and under this particular version of PSSP, MPI, etc. This is not an MPI requirement. Just the opposite, in fact.

What happens now is this simple piece of code:

      for ( i = 0; i < ROWS; i++ ) {
         for ( j = 0; j < COLUMNS; j++ ) {
            matrix [i][j] = my_cartesian_rank;
         }
      }
Every process simply initializes its own little matrix with its own cartesian rank number. The whole matrix is initialized, including the parts that are going to be dedicated to mirroring, along the borders of the matrix.

Once the matrices have been initialized, the processes send them to the master process, who dilligently collects them and displays on standard output:

      if (my_cartesian_rank != MASTER_RANK ) 
         MPI_Send ( matrix, COLUMNS * ROWS, MPI_INT, MASTER_RANK, 3003,
                    cartesian_communicator );
      else
         collect_matrices ( cartesian_communicator, my_cartesian_rank,
                            matrix, 3003 );
But observe that this time the communication takes place entirely within the cartesian_communicator, and the process that collects all matrices does not have to be the same process as the master process in the MPI_COMM_WORLD communicator.

Now the processes in the cartesian_communicator have to exchange information with their neighbours:

      MPI_Sendrecv ( &matrix[ROWS - 2][0], COLUMNS, MPI_INT, 
                     top_neighbour, 4004,
                     &matrix[0][0], COLUMNS, MPI_INT, bottom_neighbour, 
                     4004, 
                     cartesian_communicator, &status );

      MPI_Sendrecv ( &matrix[1][0], COLUMNS, MPI_INT, bottom_neighbour,
                     5005,
                     &matrix[ROWS - 1][0], COLUMNS, MPI_INT, 
                     top_neighbour, 5005,
                     cartesian_communicator, &status );
While sending this stuff up and down, we're making use of the fact that matrices in C are stored in the row-major fashion, i.e., the other way to Fortran. The first statement sends the whole second row from the top (I have numbered these matrices upside down, so as to stick to the usual $x\times y$ convention) to the top neighbour. The send buffer begins at &matrix[ROWS - 2][0] and is COLUMNS items long. The items are of type MPI_INT and the corresponding message is going to have tag 4004.

At the same time, every process receives into its bottom row data sent by its bottom neighbour. The beginning of the receive buffer is &matrix[0][0], the buffer contains items of type MPI_INT, and the number of those items is COLUMNS.

Of course this communication must take place within the cartesian_communicator, because it is only within this communicator that the notion of a bottom and top neighbour makes sense.

The second MPI_Sendrecv operation does the same, but in the opposite direction, i.e., the data is now sent down, whereas previously it went up.

Having done that we display the state of the whole system again:

      if (my_cartesian_rank != MASTER_RANK) 
         MPI_Send ( matrix, COLUMNS * ROWS, MPI_INT, MASTER_RANK, 6006,
                    cartesian_communicator );
      else 
         collect_matrices ( cartesian_communicator, my_cartesian_rank,
                            matrix, 6006 );

The last part of the code is a little tricky. Here we have to exchange columns between neighbours, but columns are not stored contiguous in C.

MPI provides a very powerful apparatus for situations like that.

First we define a stridden data type:

      MPI_Type_vector (ROWS, 1, COLUMNS, MPI_INT, &column_type);
      MPI_Type_commit (&column_type);
This data type is constructed as follows: it comprises ROWS items of type MPI_INT. The items are collected by picking up 1 item every COLUMNS steps from a contiguous storage. The description of this peculiar data type is placed into column_type. Once the type has been defined it must be committed. Here we give a parallel machine an opportunity to adjust its hardware or software logic in order to prepare for manipulating this new data type.

Now we can exchange columns between neighbours in a much the same way we did rows before:

      MPI_Sendrecv ( &matrix[0][1], 1, column_type, left_neighbour, 7007,
                     &matrix[0][COLUMNS - 1], 1, column_type, 
                     right_neighbour, 7007,
                     cartesian_communicator, &status );
 
      MPI_Sendrecv ( &matrix[0][COLUMNS - 2], 1, column_type, 
                     right_neighbour, 8008,
                     &matrix[0][0], 1, column_type, left_neighbour, 8008,
                     cartesian_communicator, &status );
The first operation sends one item of column_type, stored at &matrix[0][1] to the left_neighbour. The corresponding message has tag 7007. At the same time another single item of column_type is received from the right_neighbour and placed in &matrix[0][COLUMNS - 1]. Again, observe that we are not sending the border columns: we are sending internal columns, and we're receiving them into the border columns.

The second MPI_Sendrecv call does the same in the opposite direction, i.e., from left to right.

After the data has been exchanged, the state of the system is displayed again by calling:

      if (my_cartesian_rank != MASTER_RANK)
         MPI_Send ( matrix, COLUMNS * ROWS, MPI_INT, MASTER_RANK, 9009,
                    cartesian_communicator );
      else
         collect_matrices ( cartesian_communicator, my_cartesian_rank,
                            matrix, 9009 );

So, this is what the program does.

It is messy and clumsy compared to an HPF program, where you don't have to bother about any of that at all. IBM HPF compiler, as a matter of fact, compiles your HPF program to an MPI program that does very much what this simple example code illustrates. To use HPF in place of MPI, wherever applicable, will save you a lot of hard work.


next up previous index
Next: Interacting Particles Up: The Diffusion Problem Previous: The Code
Zdzislaw Meglicki
2001-02-26