next up previous index
Next: Exercises Up: Basic MPI Previous: Hello World

   
Greetings, Master

The ``Hello World'' program from section 5.2.1 ran in parallel, but participating processes did not exchange any messages, so the parallelism was trivial.

In this section we're going to have a look at our first non-trivial parallel program in which processes are going to exchange information.

We are going to designate one of the processes a master process. The master process will find about the name of the processor it runs on and it will broadcast the name to all other processes. They will respond by sending greetings to the master process. The master process will collect all greetings and it will then display them on standard output.

Here is the program:

gustav@bh1 $ cat greetings.c 
/*
 * The master process broadcasts the name of the CPU it runs on to
 * the pool. All other processes respond by sending greetings
 * to the master process, which collects the messages and displays
 * them on standard output.
 *
 * %Id: greetings.c,v 1.2 2003/09/29 19:47:29 gustav Exp %
 * 
 * %Log: greetings.c,v %
 * Revision 1.2  2003/09/29 19:47:29  gustav
 * Added "The" to the comment.
 *
 * Revision 1.1  2003/09/29 19:37:58  gustav
 * Initial revision
 *
 *
 */
#include <stdio.h>    /* functions sprintf, printf and BUFSIZ defined there */
#include <string.h>   /* function strcpy defined there */
#include <stdlib.h>   /* function exit defined there */
#include <mpi.h>      /* all MPI-2 functions defined there */

#define TRUE 1        
#define FALSE 0
#define MASTER_RANK 0 /* It is traditional to make process 0 the master. */

main(argc, argv)
     int argc;
     char *argv[];
{
  int count, pool_size, my_rank, my_name_length, i_am_the_master = FALSE;
  char my_name[BUFSIZ], master_name[BUFSIZ], send_buffer[BUFSIZ],
    recv_buffer[BUFSIZ];
  MPI_Status status;

  MPI_Init(&argc, &argv);
  MPI_Comm_size(MPI_COMM_WORLD, &pool_size);
  MPI_Comm_rank(MPI_COMM_WORLD, &my_rank);
  MPI_Get_processor_name(my_name, &my_name_length);

  if (my_rank == MASTER_RANK) {
    i_am_the_master = TRUE;
    strcpy (master_name, my_name);
  }

  MPI_Bcast(master_name, BUFSIZ, MPI_CHAR, MASTER_RANK, MPI_COMM_WORLD);

  if (i_am_the_master) 
    for (count = 1; count < pool_size; count++) {
      MPI_Recv (recv_buffer, BUFSIZ, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG,
                MPI_COMM_WORLD, &status);
      printf ("%s\n", recv_buffer);
    }
  else {
    sprintf(send_buffer, "hello %s, greetings from %s, rank = %d",
            master_name, my_name, my_rank);
    MPI_Send (send_buffer, strlen(send_buffer) + 1, MPI_CHAR,
              MASTER_RANK, 0, MPI_COMM_WORLD);
  }

  MPI_Finalize();
   
  exit(0);
}

gustav@bh1 $
Although at the first glance this program does much the same as our previous example, hellow2, it does it differently and in a way that is immediately extensible to quite non-trivial MPI programs. It contains a number of new conceptual elements and some new MPI function calls.

It begins the same way any other MPI program begins. First we call MPI_Init, followed by MPI_Comm_size, MPI_Comm_rank, and finally MPI_Get_processor_name. We have discussed these already in section 5.2.1.

Then a new thing happens. The program forks. Only one of the MPI processes, namely the one whose rank is equal to MASTER_RANK executes the if clause:

  if (my_rank == MASTER_RANK) {
    i_am_the_master = TRUE;
    strcpy (master_name, my_name);
  }
whereas all other processes skip this clause, because it's not theirs, and call the next statement that follows, which is
  MPI_Bcast(master_name, BUFSIZ, MPI_CHAR, MASTER_RANK, MPI_COMM_WORLD);
This statement blocks, i.e., the processes that get to it hang and wait until process MASTER_RANK broadcasts the information.

But the master process is not there yet. Within the if clause it sets the logical flag i_am_the_master to TRUE, whereas, by default, it's been initialized to FALSE and so it stays FALSE for everybody else. Then the master processes copies the name of the processor it runs on onto a string called master_name. At this stage the master process is the only process that has anything in this string. For all other processes master_name just contains garbage (you must not assume that UNIX is so kind as to initialize all strings to nulls).

Now the master process joins the broadcast. The broadcast  operation works as follows. There is a root process for the broadcast, which is given by the fourth argument to MPI_Bcast. Here it is MASTER_RANK. The first argument is the buffer. Remember that each of the MPI processes has its own buffer that is pointed to by this variable and these buffers live on different machines and in different memory locations. But they are all referred to as master_name. Only the root process, which in our case is the master, has something sensible in its own buffer. The length of the buffer, here it is BUFSIZ, is given as the number of items of type that is specified by the third variable, in this case MPI_CHAR. In this case all processes execute this broadcast statement from the same line of the code (it doesn't have to be so normally), and this guarantees that their buffers are of identical length and that they will all interpret the data that is going to be sent in the same way, i.e., as characters. The last argument specifies the MPI communicator within which the communication is going to take place.

Now the root node broadcasts the content of its buffer to all other processes and at the end of this operation they all have the same stuff in their buffers.

This type of an operation is called  a collective operation. Collective operations tend to be rather costly, not only because data is being sent over the whole pool of MPI processes (which may be very large), but also because this operation forces all processes to wait until the operation completes. This enforces synchronization  onto the process pool and synchronization is always very costly in terms of CPU time and even wall-clock time.

Anyhow, the master process has broadcast the name of its CPU to all other processes and now the program forks again. The master process does something quite different from all other processes. This is expressed again by the means of the if statement, which has one clause for the master process and another clause for all other processes:

  if (i_am_the_master) 
    for (count = 1; count < pool_size; count++) {
      MPI_Recv (recv_buffer, BUFSIZ, MPI_CHAR, MPI_ANY_SOURCE, MPI_ANY_TAG,
                MPI_COMM_WORLD, &status);
      printf ("%s\n", recv_buffer);
    }
  else {
    sprintf(send_buffer, "hello %s, greetings from %s, rank = %d",
            master_name, my_name, my_rank);
    MPI_Send (send_buffer, strlen(send_buffer) + 1, MPI_CHAR,
              MASTER_RANK, 0, MPI_COMM_WORLD);
  }
Now you can fully appreciate why it is so important that MPI processes have ranks. Within its own clause the master process issues count - 1 receives and follows each successful receive with a print statement that prints whatever the master has received on standard output. On the other hand all other processes construct messages within their own part of the program and send them to the master process. It is usually the case in sequential programs that only one clause of an if statement is executed. Here both are executed at the same time, but they run on different CPUs.

Let us have a look at the functions that send and receive data.

Function MPI_Send sends  strlen(send_buffer) + 1 items of type MPI_CHAR to  process of rank MASTER_RANK. The items are taken from the buffer pointed to by send_buffer and the communication takes place within the MPI_COMM_WORLD communicator. The fifth variable, which is here set to 0, is the message tag. It is an arbitrary integer number that is used to differentiate between various messages. The receiving process may wish to receive messages with certain tags only.

Now the master process is going to receive the messages into its own receive buffer that is here called recv_buffer. This is the first  argument to function MPI_Recv. The second argument specifies the length of the receive buffer measured as the number of items of type MPI_CHAR, which is the third argument to MPI_Recv. Observe that since we do not know a priori how long a message is going to be, we usually have to oversize the receive buffer, but this may be overcome in various ways. There are inquiry functions that can be used to find out how much data has arrived, before the data is actually read. The fourth argument to MPI_Recv specifies the rank of the process from which we want to receive the message. We don't have to receive messages as they come. We can make them wait and we can ignore them altogether too. But in this case we say that we want to receive messages from any source (MPI_ANY_SOURCE)  and with any tag  (MPI_ANY_TAG). The messages are going to be received within the MPI_COMM_WORLD communicator, and the status of the received message is going to be written on a structure called status. We will ignore this last bit for the time being.

Let me show you how this is made, installed and run.

gustav@bh1 $ pwd
/N/B/gustav/src/I590/greetings
gustav@bh1 $ make
co  RCS/Makefile,v Makefile
RCS/Makefile,v  -->  Makefile
revision 1.1
done
co  RCS/greetings.c,v greetings.c
RCS/greetings.c,v  -->  greetings.c
revision 1.2
done
cc -I/N/hpc/mpich2/include -c greetings.c
cc -o greetings greetings.o -L/N/hpc/mpich2/lib -lmpich
gustav@bh1 $ make install
co  RCS/greetings.1,v greetings.1
RCS/greetings.1,v  -->  greetings.1
revision 1.1
done
[ -d /N/B/gustav/bin ] || mkdirhier /N/B/gustav/bin
install greetings /N/B/gustav/bin
[ -d /N/B/gustav/man/man1 ] || mkdirhier /N/B/gustav/man/man1
install greetings.1 /N/B/gustav/man/man1
gustav@bh1 $
The manual page looks like this:
GREETINGS(1)         I590 Programmer's Manual        GREETINGS(1)

NAME
       greetings - send greetings to the master process

SYNOPSIS
       mpiexec -n <number-of-processes> greetings

DESCRIPTION
       greetings  The  master process finds about the name of the
       CPU it runs on and broadcasts it to all.  Having  received
       the  broadcast other processes construct greeting messages
       and send them back to the master, who  collects  them  and
       displays on standard output.

OPTIONS
       No greetings specific options

DIAGNOSTICS
       No greetings specific diagnostics

EXAMPLES
       $ mpdboot -n 8
       $ mpiexec -n 8 greetings
       hello bc89, greetings from bc46, rank = 4
       hello bc89, greetings from bc42, rank = 2
       hello bc89, greetings from bc43, rank = 1
       hello bc89, greetings from bc47, rank = 5
       hello bc89, greetings from bc48, rank = 6
       hello bc89, greetings from bc44, rank = 3
       hello bc89, greetings from bc88, rank = 7
       $ mpdallexit

AUTHOR
       Still too trivial a program to claim authorship.

I590/7462                  October 2003              GREETINGS(1)
I run this program from the following PBS script:
gustav@bh1 $ pwd
/N/B/gustav/PBS
gustav@bh1 $ cat greetings.sh
#PBS -S /bin/bash
#PBS -N greetings
#PBS -o greetings_out
#PBS -e greetings_err
#PBS -q bg
#PBS -V
#PBS -m a
#PBS -l nodes=8
NODES=8
HOST=`hostname`
echo Local MPD console on $HOST
grep -v $HOST $PBS_NODEFILE | sed 's/$/-myri0/' > $HOME/mpd.hosts
mpdboot --totalnum=$NODES --file=$HOME/mpd.hosts 
sleep 10
mpiexec -n $NODES greetings
mpdallexit
exit 0
gustav@bh1 $ qsub greetings.sh
20674.bh1.avidd.iu.edu
gustav@bh1 $ qstat | grep gustav
20674.bh1        greetings        gustav                  0 R bg              
gustav@bh1 $ !!
qstat | grep gustav
gustav@bh1 $ cat greetings_err
gustav@bh1 $ cat greetings_out
Local MPD console on bc89
hello bc89, greetings from bc43, rank = 4
hello bc89, greetings from bc40, rank = 1
hello bc89, greetings from bc41, rank = 2
hello bc89, greetings from bc44, rank = 5
hello bc89, greetings from bc46, rank = 6
hello bc89, greetings from bc42, rank = 3
hello bc89, greetings from bc88, rank = 7
gustav@bh1 $
Observe that the messages do not arrive in any particular order.

This program is a very simple prototype of what you would have to do in order to implement non-parallel IO in absence of MPI-IO. The processes would have to send their data to a master process, which would then collect and organize the data and write it out on a file using normal UNIX IO. Until MPI-IO all MPI programs used to checkpoint this way.



 
next up previous index
Next: Exercises Up: Basic MPI Previous: Hello World
Zdzislaw Meglicki
2004-04-29