next up previous index
Next: Exercises Up: Diffusion Previous: The Code

The Discussion

This program contains numerous new elements and MPI function calls, so this section may be a little heavy. But it also illustrates where MPI differs from earlier, much simpler, but also much more primitive message passing systems. MPI is very powerful. It is not just send and receive.

The code begins innocently enough like all other MPI codes that we have seen so far: MPI itself is initialised, then every process finds out about the size of the process pool, and its own rank within that pool.

  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &pool_size);
  MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
And then, as usual, one of the processes assumes the role of the master:
  if (my_rank == MASTER_RANK) i_am_the_master = TRUE;
But the next operation  is entirely new:
  MPI_Cart_create ( MPI_COMM_WORLD, PROCESS_DIMENSIONS, divisions, 
                    periods, reorder, &cartesian_communicator );
What happens here is as follows. The processes create a new communicator, which, unlike the MPI_COMM_WORLD, has a special topology imposed on it. The new communicator, the place for which is passed to MPI_Cart_create as the last argument, is formed out of the old one, which is passed to MPI_Cart_create in the first argument. The four parameters passed in the middle: PROCESS_DIMENSIONS, divisions, periods, and reorder, specify the topology of the new communicator. When we say topology it means that there is some sort of a connectivity between the processes within this new communicator. In this case they will acquire left, right, bottom, and top neighbours.

The value of PROCESS_DIMENSIONS is 2, which means that the processes are to be organised into a 2 dimensional grid. divisions is going to be an array of rank 1, whose entries specify lengths of the process grid in those two dimensions, in this case:

#define PROCESS_ROWS        4
...
#define PROCESS_COLUMNS     3
...
  int divisions[PROCESS_DIMENSIONS] = {PROCESS_ROWS, PROCESS_COLUMNS};
that is, the processes are going to be organised into a rectangular grid with 4 rows and 3 columns.

The parameter periods specifies if the topology is going to be a wrap-around topology or an open ended topology, i.e., if the processes that are on the borders of the region should have as their over the border neighbours the guys on the other side, or nobody. In this case we simply say that

  int periods[PROCESS_DIMENSIONS] = {0, 0};
which means that we don't want any wrapping.

The last parameter in this group, reorder, specifies if processes should be reordered for efficiency. This implies that processes may be moved around between processors, or just renumbered, so as to utilize better any hardware architecture that a given machine may have. Examples of such machines are Cray T3E , Cray X1 , Fujitsu AP-3000 , and NEC Cenju-3 .

If you have asked for a $4\times3$ Cartesian communicator and started with, say, 15 processes, then 3 of those 15 processes would have to be rejected from the new communicator, because there is only going to be enough space in it for 12 processes. The rejected processes will find that after MPI_Cart_create returns, the returned value of cartesian_communicator is MPI_COMM_NULL. What this means  is that those processes didn't get the job.

They may hang about, if they choose, or they may go right for MPI_Finalize - this is up to the programmer. You may create more than one communicator with various properties, so that the processes that didn't make it into the cartesian_communicator will get jobs elsewhere.

For this reason the remainder of the program is in the form of one large if statement:

   if (cartesian_communicator != MPI_COMM_NULL) {
      blah... blah... blah...
   }

   MPI_Finalize ();
   exit (0);
}

Now, assuming that you did get a job, as a process, with the cartesian_communicator, the first thing that you do is to find your new rank number  within that new communicator:

      MPI_Comm_rank ( cartesian_communicator, &my_cartesian_rank );
And the reason for this is that your rank number in the cartesian_communicator may be entirely different from your old number in the MPI_COMM_WORLD communicator. They're like a Tax File Number in Australia and a Social Security Number in the US.

But now, since the cartesian_communicator is a communicator with a topology, you can make inquiries about your neighbourhood:

    MPI_Cart_coords ( cartesian_communicator, my_cartesian_rank,
                      PROCESS_DIMENSIONS, my_coordinates );
    MPI_Cart_shift ( cartesian_communicator, SIDEWAYS, RIGHT, 
                     &left_neighbour, &right_neighbour );
    MPI_Cart_shift ( cartesian_communicator, UPDOWN, UP,
                     &bottom_neighbour, &top_neighbour );
The first call to MPI_Cart_coords returns your Cartesian coordinates in an array my_coordinates of length PROCESS_DIMENSIONS. Naturally you have to tell  MPI_Cart_coords who you are, in order to obtain that information. You have to pass on your new Cartesian rank number in the second slot of the function.

The following two calls to MPI_Cart_shift tell you about the rank numbers of your left and right neighbours and of your bottom and top neighbours. So the function  MPI_Cart_shift doesn't really shift anything. It's more like looking up who your neighbours are. But the designers of MPI thought of it in terms of shifting rank numbers of your neighbours from the left and from the right and from above and from below into those containers, which you call left_neighbour, right_neighbour, bottom_neighbour, and top_neighbour.

This stuff is worth displaying, so what we do now is to pack all that information into a message and send it to the master process.

    if (! i_am_the_master ) {
      sprintf(char_buffer, "process %2d, cartesian %2d, \
coords (%2d,%2d), left %2d, right %2d, top %2d, bottom %2d",
              my_rank, my_cartesian_rank, my_coordinates[0],
              my_coordinates[1], left_neighbour, right_neighbour,
              top_neighbour, bottom_neighbour);
      MPI_Send(char_buffer, strlen(char_buffer) + 1, MPI_CHAR,
               MASTER_RANK, 3003, MPI_COMM_WORLD);
    } 
    else {

      int number_of_c_procs, count;

      number_of_c_procs = divisions[0] * divisions[1]; 
      for (count = 0; count < number_of_c_procs - 1; count++) {
        MPI_Recv(char_buffer, BUFSIZ, MPI_CHAR, MPI_ANY_SOURCE, 3003,
                 MPI_COMM_WORLD, &status);
        printf ("%s\n", char_buffer);
      }
      printf( "process %2d, cartesian %2d, \
coords (%2d,%2d), left %2d, right %2d, top %2d, bottom %2d\n",
              my_rank, my_cartesian_rank, my_coordinates[0],
              my_coordinates[1], left_neighbour, right_neighbour,
              top_neighbour, bottom_neighbour);
    }
Observe that this communication takes place within the MPI_COMM_WORLD, not within the cartesian_communicator. The old communicator didn't go away, when the new one was created, so it can be still used for various things.

Observe a subtle bug here: we are assuming that the master process is still going to hang around! In other words that the master process was included into the cartesian_communicator.

What will happen if it is not included?

In our case we are going to run this job requesting exactly the right number of processes, so the master process is guaranteed to be included into the cartesian_communicator. Having said that you may consider modifying the logic of the program so as to get rid of the bug.

Here is what the output looks like, once the master process gets down to it:

process  1, cartesian  1, coords ( 0, 1), left  0, right  2, top  4, bottom -1
process 10, cartesian 10, coords ( 3, 1), left  9, right 11, top -1, bottom  7
process 11, cartesian 11, coords ( 3, 2), left 10, right -1, top -1, bottom  8
process  9, cartesian  9, coords ( 3, 0), left -1, right 10, top -1, bottom  6
process  8, cartesian  8, coords ( 2, 2), left  7, right -1, top 11, bottom  5
process  7, cartesian  7, coords ( 2, 1), left  6, right  8, top 10, bottom  4
process  6, cartesian  6, coords ( 2, 0), left -1, right  7, top  9, bottom  3
process  5, cartesian  5, coords ( 1, 2), left  4, right -1, top  8, bottom  2
process  2, cartesian  2, coords ( 0, 2), left  1, right -1, top  5, bottom -1
process  3, cartesian  3, coords ( 1, 0), left -1, right  4, top  6, bottom  0
process  4, cartesian  4, coords ( 1, 1), left  3, right  5, top  7, bottom  1
process  0, cartesian  0, coords ( 0, 0), left -1, right  1, top  3, bottom -1

Observe that the cartesian rank numbers here are the same as the world rank numbers. But you must not count on it. It just happens to be so here, on the AVIDD, for this particular program, and under this particular version of MPICH2. This is not an MPI requirement.

What happens next is this simple piece of code:

    for ( i = 0; i < ROWS; i++ ) {
      for ( j = 0; j < COLUMNS; j++ ) {
        matrix [i][j] = my_cartesian_rank;
      }
    }
Every process simply initializes its own little matrix with its own cartesian rank number. The whole matrix is initialized, including the parts that are going to be dedicated to mirroring, along the borders of the matrix.

Once the matrices have been initialized, the processes send them to the master process, who dilligently collects them and displays on standard output:

    if (my_cartesian_rank != MASTER_RANK ) 
      MPI_Send ( matrix, COLUMNS * ROWS, MPI_INT, MASTER_RANK, 6006,
                 cartesian_communicator );
    else 
      collect_matrices ( cartesian_communicator, my_cartesian_rank,
                         matrix, 6006 );
Function collect_matrices hides the tedium of collecting, arranging and displaying data received from other nodes.

This time the communication takes place entirely within the cartesian_communicator, and the process that collects all matrices does not have to be the same process as the master process in the MPI_COMM_WORLD communicator.

Now the processes in the cartesian_communicator have to exchange information with their neighbours:

    MPI_Sendrecv ( &matrix[ROWS - 2][0], COLUMNS, MPI_INT, 
                   top_neighbour, 4004,
                   &matrix[0][0], COLUMNS, MPI_INT, bottom_neighbour, 
                   4004, 
                   cartesian_communicator, &status );

    MPI_Sendrecv ( &matrix[1][0], COLUMNS, MPI_INT, bottom_neighbour,
                   5005,
                   &matrix[ROWS - 1][0], COLUMNS, MPI_INT, 
                   top_neighbour, 5005,
                   cartesian_communicator, &status );
While sending this stuff up and down with the MPI_Sendrecv operation , we're making use of the fact that matrices in C are stored in the row-major fashion, i.e., the other way to Fortran. The first statement sends the whole second row from the top (I have numbered these matrices upside down, so as to stick to the usual $x\times y$ convention) to the top neighbour. The send buffer begins at &matrix[ROWS - 2][0] and is COLUMNS items long. The items are of type MPI_INT and the corresponding message is going to have tag 4004.

At the same time, every process receives into its bottom row data sent by its bottom neighbour. The beginning of the receive buffer is &matrix[0][0], the buffer contains items of type MPI_INT, and the number of those items is COLUMNS.

Of course this communication must take place within the cartesian_communicator, because it is only within this communicator that the notion of a bottom and top neighbour makes sense.

The second MPI_Sendrecv operation does the same, but in the opposite direction, i.e., the data is now sent down, whereas previously it went up.

Having done that we display the state of the whole system again:

   if (my_cartesian_rank != MASTER_RANK ) 
      MPI_Send ( matrix, COLUMNS * ROWS, MPI_INT, MASTER_RANK, 6006,
                 cartesian_communicator );
    else 
      collect_matrices ( cartesian_communicator, my_cartesian_rank,
                         matrix, 6006 );

The last part of the code is a little tricky. Here we have to exchange columns between neighbours, but columns are not stored in C contiguously.

MPI provides a very powerful apparatus for situations like this one.

First we define a stridden data type:

    MPI_Type_vector (ROWS, 1, COLUMNS, MPI_INT, &column_type);
    MPI_Type_commit (&column_type);
Function  MPI_Type_vector constructs this special stridden data type as follows: the data type comprises ROWS items of type MPI_INT. The items are collected by picking up 1 item every COLUMNS steps from a contiguous storage. The description of this peculiar data type is placed into column_type. Once the type has been defined it must be committed by calling function  MPI_Type_commit. Here we give a parallel machine an opportunity to adjust its hardware or software logic in order to prepare for manipulating this new data type.

Now we can exchange columns between neighbours in a much the same way we did rows before:

    MPI_Sendrecv ( &matrix[0][1], 1, column_type, left_neighbour, 7007,
                   &matrix[0][COLUMNS - 1], 1, column_type, 
                   right_neighbour, 7007,
                   cartesian_communicator, &status );
 
    MPI_Sendrecv ( &matrix[0][COLUMNS - 2], 1, column_type, 
                   right_neighbour, 8008,
                   &matrix[0][0], 1, column_type, left_neighbour, 8008,
                   cartesian_communicator, &status );
The first operation sends one item of type column_type, stored at &matrix[0][1] to the left_neighbour. The corresponding message has tag 7007. At the same time another single item of column_type is received from the right_neighbour and placed in &matrix[0][COLUMNS - 1]. Again, observe that we are not sending the border columns: we are sending internal columns, and we're receiving them into the border columns. The second MPI_Sendrecv call does the same in the opposite direction, i.e., from left to right.

After the data has been exchanged, the state of the system is displayed again by calling:

    if (my_cartesian_rank != MASTER_RANK )
      MPI_Send ( matrix, COLUMNS * ROWS, MPI_INT, MASTER_RANK, 9009,
                 cartesian_communicator );
    else
      collect_matrices ( cartesian_communicator, my_cartesian_rank,
                         matrix, 9009 );
So, this is what the program does.

In spite of MPI's powerful auxiliary functions this is still messy, clumsy and, worst of all, error prone, compared to data parallel environments such as High Performance Fortran that do it all automatically. IBM and PGI HPF compilers, as a matter of fact, compile your HPF  program to an MPI code that does very much what this simple example illustrates. To use HPF in place of MPI, wherever applicable, will save you a lot of hard work.

HPF  has recently been installed on the AVIDD cluster and it lives in /opt/pgi/linux86/bin.


next up previous index
Next: Exercises Up: Diffusion Previous: The Code
Zdzislaw Meglicki
2004-04-29