next up previous index
Next: MPE Graphics Up: MPI Graphics and Process Previous: MPI Graphics and Process

Instrumenting MPI Programs with MPE

You have seen a simple and quite effective example of program instrumentation  in section 5.2.4 that talked about the Bank Queue paradigm. There we instrumented the program by writing logs of various operations on log files, one per each process.

The MPE library provides a much more sophisticated tools for logging various events and states in your program. These don't have to be MPI events or states only. You can use MPE logging for other events and states too. The logs are created on the run and stored in memory, so as to minimize the impact that logging may have on the program's performance. Only after the logging and the program have been completed, the actual log file is written.

Unlike our simple logging example in section 5.2.4, MPE logging is accompanied by very precise time stamps. This can help solve some intricate communication and synchronization problems as well as optimize communications and computations within the program. So MPE logging can be used not only to debug a parallel program, but also to profile its execution.

MPE logging  can be ``by default'', in which case just about everything that happens in the program is logged, or it can be customized. The latter is more useful for a programmer, the former is simpler, because the programmer doesn't have to do anything. All that's required is to link the program with the -llmpe library.

Suppose you would like to log all MPI events generated by program cpi.c discussed in section 5.2.3. This program comes from the MPICH2 examples that are distributed together with the code. To generate the logging version of the program compile it as follows:

gustav@bh1 $ pwd
/N/B/gustav/src/I590/cpi
gustav@bh1 $ mpicc -o cpi cpi.c -llmpe -lmpe
gustav@bh1 $
When the program is run, it should generate a file called cpi.clog in  your working directory:
gustav@bh1 $ pwd
/N/B/gustav
gustav@bh1 $ mpdtrace | wc
     27      27     134
gustav@bh1 $ mpiexec -n 8 cpi
Process 0 of 8 is on bh1
Process 1 of 8 is on bc30
Process 2 of 8 is on bc23
Process 6 of 8 is on bc37
Process 4 of 8 is on bc36
Process 3 of 8 is on bc35
Process 5 of 8 is on bc34
Process 7 of 8 is on bc27
pi is approximately 3.1415926544231247, Error is 0.0000000008333316
wall clock time = 0.042839
Writing logfile.
Finished writing logfile.
gustav@bh1 $ ls *log*
cpi.clog
gustav@bh1 $
This file is written in the ``native'' format and is not readable without auxiliary tools, but it can be converted to the human readable alog format  by calling the  clog2alog utility:
gustav@bh1 $ clog2alog cpi
gustav@bh1 $ ls *cpi*
cpi.alog  cpi.clog
gustav@bh1 $
You don't need to supply the .clog suffix, because the program adds it automatically to the file basename you provide it with on the command line. The alog file contains an ASCI head followed by an ASCI listing of all events and states logged by the program. This is what it may look like:
gustav@bh1 $ cat cpi.alog
-1 0 0 0 0 0 Me
-2 0 0 35 0 0
-3 0 0 8 0 0
-4 0 0 1 0 0
-5 0 0 5 0 0
-6 0 0 0 0 0
-7 0 0 0 0 42890
-8 0 0 1 0 0
-11 0 0 0 0 0
-13 0 13 14 0 0 cyan:boxes BCAST
-13 0 25 26 0 0 purple:2x2 REDUCE
-201 0 0 -1 0 0 MPI_PROC_NULL
-201 0 0 -2 0 0 MPI_ANY_SOURCE
-201 0 0 -1 0 1 MPI_ANY_TAG
13 0 0 1 0 55 
14 0 0 2 0 83 
25 0 0 1 0 108 
13 1 0 1 0 4088 
14 1 0 2 0 4114 
25 1 0 1 0 4138 
13 6 0 1 0 7042 
13 2 0 1 0 7290 
14 2 0 2 0 7327 
25 2 0 1 0 7351 
13 4 0 1 0 8855 
14 4 0 2 0 8938 
25 4 0 1 0 8966 
14 6 0 2 0 9108 
25 6 0 1 0 9137 
13 3 0 1 0 15425 
14 3 0 2 0 15466 
25 3 0 1 0 15491 
13 5 0 1 0 15973 
14 5 0 2 0 16034 
25 5 0 1 0 16060 
13 7 0 1 0 16155 
14 7 0 2 0 16207 
25 7 0 1 0 16231 
26 1 0 2 0 25906 
26 3 0 2 0 41456 
26 2 0 2 0 41716 
26 7 0 2 0 41867 
26 6 0 2 0 42135 
26 5 0 2 0 42372 
26 4 0 2 0 42565 
26 0 0 2 0 42890 
gustav@bh1 $
Even though the format of the file is human readable, the file is still rather cryptic and not very helpful.

A better way is to convert the clog file to an  slog2 file with the  clogTOslog2 utility (here you have to type the full name of the file including its extension):

gustav@bh1 $ clogTOslog2 cpi.clog
GUI_LIBDIR is set. GUI_LIBDIR = \
         /N/B/gustav/src/mpich2-0.94b1/src/mpe/slog2sdk/lib
         SLOG-2 Header:
version = SLOG 2.0.5
NumOfChildrenPerNode = 2
TreeLeafByteSize = 65536
MaxTreeDepth = 0
MaxBufferByteSize = 574
Categories  is FBinfo(94 @ 682)
MethodDefs  is FBinfo(0 @ 0)
LineIDMaps  is FBinfo(100 @ 776)
TreeRoot    is FBinfo(574 @ 108)
TreeDir     is FBinfo(38 @ 876)
Annotations is FBinfo(0 @ 0)
Postamble   is FBinfo(0 @ 0)

Number of Drawables = 16
Number of Unmatched Events = 0
Total ByteSize of the logfile = 3688
timeElapsed between 1 & 2 = 873 msec
timeElapsed between 2 & 3 = 50 msec
gustav@bh1 $ ls *log*
cpi.alog  cpi.clog  cpi.slog2
gustav@bh1 $
and then run program  jumpshot on the slog2 file.

Jumpshot is a rather large (and so heavy that it is quite unusable over an ISDN line) Java program that reads the slog2 file and displays events and states of the program in the form of multiple long horizontal ``thermometers'', each ``thermometer'' corresponding to a separate process. Message passing is symbolized by arrows that connect various ``thermometers''. This way you can see exactly when a message has been sent and when it has been received. Various states of the program are symbolized by various colours of the ``thermometer'' lines. For example, computation may be symbolized by the ``thermometer'' going red, whereas waiting for data may be symbolized by, say, black or gray. The idea behind an efficient parallel program is to have as much red and as little black as possible. The window of jumpshot that displays all this is called  the Timeline Window.

A long program, which involves a lot of communication may produce a very dense picture in the Timeline Window that is going to be difficult to read at first glance. You can use a magnifying glass gizmo, invoked from the bottom right corner of the window, to stretch fragments of the log, so that you can see what happens there in greater detail. There is also another magnifying glass in the tool bar above the window, which can be used to select specific portions of the log for a closer look.

In order to customize the logging, you have to instrument your program explicitly. Here is the example of cpi.c, obtained from the MPE component of the MPICH2 source, instrumented this way:

gustav@bh1 $ pwd
/N/B/gustav/src/I590/mpe
gustav@bh1 $ cat cpilog.c
#include <mpi.h>
#include <mpe.h>
#include <math.h>
#include <stdio.h>

double f( double );
double f( double a)
{
    return (4.0 / (1.0 + a*a));
}

int main( int argc, char *argv[])
{
    int  n, myid, numprocs, i, j;
    double PI25DT = 3.141592653589793238462643;
    double mypi, pi, h, sum, x;
    double startwtime = 0.0, endwtime;
    int namelen; 
    int event1a, event1b, event2a, event2b,
        event3a, event3b, event4a, event4b;
    char processor_name[MPI_MAX_PROCESSOR_NAME];

    MPI_Init(&argc,&argv);
    MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
    MPI_Comm_rank(MPI_COMM_WORLD,&myid);

    MPI_Get_processor_name(processor_name,&namelen);
    fprintf(stderr,"Process %d running on %s\n", myid, processor_name);

    MPE_Init_log();

    /*  Get event ID from MPE, user should NOT assign event ID  */
    event1a = MPE_Log_get_event_number(); 
    event1b = MPE_Log_get_event_number(); 
    event2a = MPE_Log_get_event_number(); 
    event2b = MPE_Log_get_event_number(); 
    event3a = MPE_Log_get_event_number(); 
    event3b = MPE_Log_get_event_number(); 
    event4a = MPE_Log_get_event_number(); 
    event4b = MPE_Log_get_event_number(); 

    if (myid == 0) {
	MPE_Describe_state(event1a, event1b, "Broadcast", "red");
	MPE_Describe_state(event2a, event2b, "Compute",   "blue");
	MPE_Describe_state(event3a, event3b, "Reduce",    "green");
	MPE_Describe_state(event4a, event4b, "Sync",      "orange");
    }

    if (myid == 0) 
    {
	n = 1000000;
	startwtime = MPI_Wtime();
    }
    MPI_Barrier(MPI_COMM_WORLD);

    MPE_Start_log();

    for (j = 0; j < 5; j++)
    {
	MPE_Log_event(event1a, 0, "start broadcast");
	MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
	MPE_Log_event(event1b, 0, "end broadcast");
    
	MPE_Log_event(event4a,0,"Start Sync");
	MPI_Barrier(MPI_COMM_WORLD);
	MPE_Log_event(event4b,0,"End Sync");

	MPE_Log_event(event2a, 0, "start compute");
	h   = 1.0 / (double) n;
	sum = 0.0;
	for (i = myid + 1; i <= n; i += numprocs)
	{
	    x = h * ((double)i - 0.5);
	    sum += f(x);
	}
	mypi = h * sum;
	MPE_Log_event(event2b, 0, "end compute");

	MPE_Log_event(event3a, 0, "start reduce");
	MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
	MPE_Log_event(event3b, 0, "end reduce");
    }

    MPE_Finish_log("cpilog");

    if (myid == 0)
    {
	endwtime = MPI_Wtime();
	printf("pi is approximately %.16f, Error is %.16f\n",
	       pi, fabs(pi - PI25DT));
	printf("wall clock time = %f\n", endwtime-startwtime);
    }
    MPI_Finalize();
    return(0);
}
gustav@bh1 $
The program should be linked with the -lmpe library only as follows:
gustav@bh1 $ mpicc -o cpilog cpilog.c -lmpe
gustav@bh1 $
Then, once it's run, it'll generate the log file on cpilog.clog, which can be then processed with clogTOslog2 and viewed with jumpshot.

Let us have a closer look at this instrumented program.

The call to MPE_Init_log()  initiates the MPE logging subsystem. This call has to be issued explicitly only if you are going to link the program with the -lmpe library. If you link the program with the -llmpe -lmpe libraries, the function is called by the -llmpe library right at the beginning of the program implicitly, but then everything is going to be logged, including the new stuff you have requested in your code and all the defaults as well. The advantage of starting and stopping the logging within your program explicitly is that you can decide to log only a specific part of the program or even to log various parts of it on various files.

Once the logging has been initialized we can define various events:

    event1a = MPE_Log_get_event_number(); 
    event1b = MPE_Log_get_event_number(); 
    event2a = MPE_Log_get_event_number(); 
    event2b = MPE_Log_get_event_number(); 
    event3a = MPE_Log_get_event_number(); 
    event3b = MPE_Log_get_event_number(); 
    event4a = MPE_Log_get_event_number(); 
    event4b = MPE_Log_get_event_number();
These calls to MPE_Log_get_event_number  merely assign dynamically generated integer numbers to variables event1a through event4b. You could assign the numbers to these events manually and this is how it's been done in the past (see, e.g., ``Using MPI''), but this is not a recommended practice today. At this stage we have only specified that we are going to have eight different events, but we haven't told MPE what these events are going to be yet.

Next we see the definition of states issued by the means of calls to function MPE_Describe_state :

    if (myid == 0) {
        MPE_Describe_state(event1a, event1b, "Broadcast", "red");
        MPE_Describe_state(event2a, event2b, "Compute",   "blue");
        MPE_Describe_state(event3a, event3b, "Reduce",    "green");
        MPE_Describe_state(event4a, event4b, "Sync",      "orange");
    }
These definitions still don't define what the events themselves are, but by now the programmer has already made her choices. Events event1a and event1b correspond to the beginning and the end of broadcast. When the log is written and then read and processed by jumpshot, the duration of a broadcast will be symbolized by a horizontal red line in the Timeline Window, and the legend attached to the display will say that the red colour corresponds to a broadcast. Similarly, computation will be symbolized by blue, the MPI_Reduce  will be symbolized by green and the duration of MPI_Barrier  will be symbolized by orange. So, in this case a lot of blue means ``good'', and a lot of other colours meand ``bad''.

The state definitions need not be specified by more than just one process. All processes will log their events independently, but it is enough for MPE to read state definitions from one process only. These will then be applied to all other processes.

The log itself is started by the call to MPE_Start_log()  and it is only now that we specify what the events event1a through event4b really mean, for example:

        MPE_Log_event(event1a, 0, "start broadcast");
        MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
        MPE_Log_event(event1b, 0, "end broadcast");
Function MPE_Log_event  logs the event even1a labeled "start broadcast" just before calling MPI_Bcast, and the logs the event event1b with the label "end broadcast" just after function MPI_Bcast has returned. Everything between these two events will be marked with the red colour on the jumpshot display, as has been specified with the calls to MPE_Describe_state.

Of course, the programmer may well get it all wrong, which is exactly where automatic logging is advantageous.

The call to MPI_Barrier is similarly flanked with logged events:

        MPE_Log_event(event4a,0,"Start Sync");
        MPI_Barrier(MPI_COMM_WORLD);
        MPE_Log_event(event4b,0,"End Sync");
and then we flank the whole computation section with events event2a and event2b:
        MPE_Log_event(event2a, 0, "start compute");
        h   = 1.0 / (double) n;
        sum = 0.0;
        for (i = myid + 1; i <= n; i += numprocs)
        {
            x = h * ((double)i - 0.5);
            sum += f(x);
        }
        mypi = h * sum;
        MPE_Log_event(event2b, 0, "end compute");
Finally we flank the call to MPI_Reduce with event3a and event3b:
        MPE_Log_event(event3a, 0, "start reduce");
        MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
        MPE_Log_event(event3b, 0, "end reduce");
The call to
    MPE_Finish_log("cpilog");
closes  the log and writes it on the file cpilog.clog. This last call has to be issued explicitly only if you link the program with -lmpe only. If you were to link it with -llmpe -lmpe instead, the file would be generated automatically and its name would default to the name of the binary.

To summarize let me list the synopsis for each MPE log function discussed in this section:

int MPE_Init_log(void);
int MPE_Start_log(void);
int MPE_Stop_log(void);
int MPE_Finish_log(char *logfilename);
int MPE_Describe_state(int start, int end, char *name, char *color);
int MPE_Describe_event(int event, char *name);
int MPE_Log_event(int event, int intdata, char *chardata);
int MPE_Log_get_event_number(void);
 


next up previous index
Next: MPE Graphics Up: MPI Graphics and Process Previous: MPI Graphics and Process
Zdzislaw Meglicki
2004-04-29