You have seen a simple and quite effective example of program instrumentation in section 5.2.4 that talked about the Bank Queue paradigm. There we instrumented the program by writing logs of various operations on log files, one per each process.
The MPE library provides a much more sophisticated tools for logging various events and states in your program. These don't have to be MPI events or states only. You can use MPE logging for other events and states too. The logs are created on the run and stored in memory, so as to minimize the impact that logging may have on the program's performance. Only after the logging and the program have been completed, the actual log file is written.
Unlike our simple logging example in section 5.2.4, MPE logging is accompanied by very precise time stamps. This can help solve some intricate communication and synchronization problems as well as optimize communications and computations within the program. So MPE logging can be used not only to debug a parallel program, but also to profile its execution.
MPE logging
can be ``by default'', in which case just about everything
that happens in the program is logged, or it can be customized. The
latter is more useful for a programmer, the former is simpler, because
the programmer doesn't have to do anything. All that's required is
to link the program with the -llmpe library.
Suppose you would like to log all MPI events generated by program
cpi.c discussed in section 5.2.3. This program comes
from the MPICH2 examples that are distributed together with the code.
To generate the logging version of the program compile it
as follows:
gustav@bh1 $ pwd /N/B/gustav/src/I590/cpi gustav@bh1 $ mpicc -o cpi cpi.c -llmpe -lmpe gustav@bh1 $When the program is run, it should generate a file called
cpi.clog in
your working directory:
gustav@bh1 $ pwd
/N/B/gustav
gustav@bh1 $ mpdtrace | wc
27 27 134
gustav@bh1 $ mpiexec -n 8 cpi
Process 0 of 8 is on bh1
Process 1 of 8 is on bc30
Process 2 of 8 is on bc23
Process 6 of 8 is on bc37
Process 4 of 8 is on bc36
Process 3 of 8 is on bc35
Process 5 of 8 is on bc34
Process 7 of 8 is on bc27
pi is approximately 3.1415926544231247, Error is 0.0000000008333316
wall clock time = 0.042839
Writing logfile.
Finished writing logfile.
gustav@bh1 $ ls *log*
cpi.clog
gustav@bh1 $
This file is
written in the ``native'' format and is not readable without
auxiliary tools,
but it can be converted to the human readable alog
format by
calling the clog2alog utility:gustav@bh1 $ clog2alog cpi gustav@bh1 $ ls *cpi* cpi.alog cpi.clog gustav@bh1 $You don't need to supply the
.clog suffix, because the program
adds it automatically to the file basename you provide it with
on the command line. The alog file contains an ASCI head
followed by an ASCI listing of all events and states logged by
the program. This is what it may look like:gustav@bh1 $ cat cpi.alog -1 0 0 0 0 0 Me -2 0 0 35 0 0 -3 0 0 8 0 0 -4 0 0 1 0 0 -5 0 0 5 0 0 -6 0 0 0 0 0 -7 0 0 0 0 42890 -8 0 0 1 0 0 -11 0 0 0 0 0 -13 0 13 14 0 0 cyan:boxes BCAST -13 0 25 26 0 0 purple:2x2 REDUCE -201 0 0 -1 0 0 MPI_PROC_NULL -201 0 0 -2 0 0 MPI_ANY_SOURCE -201 0 0 -1 0 1 MPI_ANY_TAG 13 0 0 1 0 55 14 0 0 2 0 83 25 0 0 1 0 108 13 1 0 1 0 4088 14 1 0 2 0 4114 25 1 0 1 0 4138 13 6 0 1 0 7042 13 2 0 1 0 7290 14 2 0 2 0 7327 25 2 0 1 0 7351 13 4 0 1 0 8855 14 4 0 2 0 8938 25 4 0 1 0 8966 14 6 0 2 0 9108 25 6 0 1 0 9137 13 3 0 1 0 15425 14 3 0 2 0 15466 25 3 0 1 0 15491 13 5 0 1 0 15973 14 5 0 2 0 16034 25 5 0 1 0 16060 13 7 0 1 0 16155 14 7 0 2 0 16207 25 7 0 1 0 16231 26 1 0 2 0 25906 26 3 0 2 0 41456 26 2 0 2 0 41716 26 7 0 2 0 41867 26 6 0 2 0 42135 26 5 0 2 0 42372 26 4 0 2 0 42565 26 0 0 2 0 42890 gustav@bh1 $Even though the format of the file is human readable, the file is still rather cryptic and not very helpful.
A better way is to convert the clog file to
an slog2 file
with the clogTOslog2 utility (here you
have to type the full name of the file including its extension):
gustav@bh1 $ clogTOslog2 cpi.clog
GUI_LIBDIR is set. GUI_LIBDIR = \
/N/B/gustav/src/mpich2-0.94b1/src/mpe/slog2sdk/lib
SLOG-2 Header:
version = SLOG 2.0.5
NumOfChildrenPerNode = 2
TreeLeafByteSize = 65536
MaxTreeDepth = 0
MaxBufferByteSize = 574
Categories is FBinfo(94 @ 682)
MethodDefs is FBinfo(0 @ 0)
LineIDMaps is FBinfo(100 @ 776)
TreeRoot is FBinfo(574 @ 108)
TreeDir is FBinfo(38 @ 876)
Annotations is FBinfo(0 @ 0)
Postamble is FBinfo(0 @ 0)
Number of Drawables = 16
Number of Unmatched Events = 0
Total ByteSize of the logfile = 3688
timeElapsed between 1 & 2 = 873 msec
timeElapsed between 2 & 3 = 50 msec
gustav@bh1 $ ls *log*
cpi.alog cpi.clog cpi.slog2
gustav@bh1 $
and then run program
jumpshot
on the slog2 file.
Jumpshot is a rather large (and so heavy that it is quite unusable
over an ISDN line)
Java program that reads the slog2 file
and displays events and states of the program in the form of multiple
long horizontal ``thermometers'', each ``thermometer'' corresponding
to a separate process. Message passing is symbolized by arrows that
connect various ``thermometers''. This way you can see exactly when
a message has been sent and when it has been received. Various
states of the program are symbolized by various colours of the ``thermometer''
lines. For example, computation may be symbolized by the ``thermometer''
going red, whereas waiting for data may be symbolized by, say, black or
gray. The idea behind an efficient parallel program is to have
as much red and as little black as possible. The window of jumpshot
that displays all this is called
the
Timeline Window.
A long program, which involves a lot of communication may produce a very dense picture in the Timeline Window that is going to be difficult to read at first glance. You can use a magnifying glass gizmo, invoked from the bottom right corner of the window, to stretch fragments of the log, so that you can see what happens there in greater detail. There is also another magnifying glass in the tool bar above the window, which can be used to select specific portions of the log for a closer look.
In order to customize the logging, you have to instrument your
program explicitly. Here is the example of cpi.c, obtained
from the MPE component of the MPICH2 source, instrumented
this way:
gustav@bh1 $ pwd
/N/B/gustav/src/I590/mpe
gustav@bh1 $ cat cpilog.c
#include <mpi.h>
#include <mpe.h>
#include <math.h>
#include <stdio.h>
double f( double );
double f( double a)
{
return (4.0 / (1.0 + a*a));
}
int main( int argc, char *argv[])
{
int n, myid, numprocs, i, j;
double PI25DT = 3.141592653589793238462643;
double mypi, pi, h, sum, x;
double startwtime = 0.0, endwtime;
int namelen;
int event1a, event1b, event2a, event2b,
event3a, event3b, event4a, event4b;
char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc,&argv);
MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
MPI_Comm_rank(MPI_COMM_WORLD,&myid);
MPI_Get_processor_name(processor_name,&namelen);
fprintf(stderr,"Process %d running on %s\n", myid, processor_name);
MPE_Init_log();
/* Get event ID from MPE, user should NOT assign event ID */
event1a = MPE_Log_get_event_number();
event1b = MPE_Log_get_event_number();
event2a = MPE_Log_get_event_number();
event2b = MPE_Log_get_event_number();
event3a = MPE_Log_get_event_number();
event3b = MPE_Log_get_event_number();
event4a = MPE_Log_get_event_number();
event4b = MPE_Log_get_event_number();
if (myid == 0) {
MPE_Describe_state(event1a, event1b, "Broadcast", "red");
MPE_Describe_state(event2a, event2b, "Compute", "blue");
MPE_Describe_state(event3a, event3b, "Reduce", "green");
MPE_Describe_state(event4a, event4b, "Sync", "orange");
}
if (myid == 0)
{
n = 1000000;
startwtime = MPI_Wtime();
}
MPI_Barrier(MPI_COMM_WORLD);
MPE_Start_log();
for (j = 0; j < 5; j++)
{
MPE_Log_event(event1a, 0, "start broadcast");
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPE_Log_event(event1b, 0, "end broadcast");
MPE_Log_event(event4a,0,"Start Sync");
MPI_Barrier(MPI_COMM_WORLD);
MPE_Log_event(event4b,0,"End Sync");
MPE_Log_event(event2a, 0, "start compute");
h = 1.0 / (double) n;
sum = 0.0;
for (i = myid + 1; i <= n; i += numprocs)
{
x = h * ((double)i - 0.5);
sum += f(x);
}
mypi = h * sum;
MPE_Log_event(event2b, 0, "end compute");
MPE_Log_event(event3a, 0, "start reduce");
MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
MPE_Log_event(event3b, 0, "end reduce");
}
MPE_Finish_log("cpilog");
if (myid == 0)
{
endwtime = MPI_Wtime();
printf("pi is approximately %.16f, Error is %.16f\n",
pi, fabs(pi - PI25DT));
printf("wall clock time = %f\n", endwtime-startwtime);
}
MPI_Finalize();
return(0);
}
gustav@bh1 $
The program
should be linked with the -lmpe library only as follows:gustav@bh1 $ mpicc -o cpilog cpilog.c -lmpe gustav@bh1 $Then, once it's run, it'll generate the log file on
cpilog.clog,
which can be then processed with clogTOslog2 and viewed with
jumpshot.
Let us have a closer look at this instrumented program.
The call to MPE_Init_log()
initiates the MPE logging subsystem. This call has to be issued
explicitly only if you are going to link the program with
the -lmpe library. If you link the program with the
-llmpe -lmpe libraries, the function is called by the -llmpe
library
right at the beginning of the program implicitly, but then everything
is going to be logged, including the new stuff you have requested
in your code and all the
defaults as well. The advantage
of starting and stopping the logging within your program explicitly
is that you can decide to log only a specific part of the program
or even to log various parts of it on various files.
Once the logging has been initialized we can define various events:
event1a = MPE_Log_get_event_number();
event1b = MPE_Log_get_event_number();
event2a = MPE_Log_get_event_number();
event2b = MPE_Log_get_event_number();
event3a = MPE_Log_get_event_number();
event3b = MPE_Log_get_event_number();
event4a = MPE_Log_get_event_number();
event4b = MPE_Log_get_event_number();
These calls to
MPE_Log_get_event_number
merely assign dynamically generated
integer numbers to variables event1a
through event4b. You could assign the numbers to these events
manually and this is how it's been done in the past (see, e.g., ``Using MPI''),
but this is not a recommended practice today. At this stage
we have only specified that we are going to have eight different
events, but we haven't told MPE what these events are going to
be yet.
Next we see the definition of states issued by the means
of calls to function MPE_Describe_state :
if (myid == 0) {
MPE_Describe_state(event1a, event1b, "Broadcast", "red");
MPE_Describe_state(event2a, event2b, "Compute", "blue");
MPE_Describe_state(event3a, event3b, "Reduce", "green");
MPE_Describe_state(event4a, event4b, "Sync", "orange");
}
These definitions still don't define what the events themselves are,
but by now the programmer has already made her choices. Events
event1a and event1b correspond to the beginning and
the end of broadcast. When the log is written and then read and processed
by jumpshot, the duration of a broadcast will be symbolized
by a horizontal red line in the Timeline Window, and the
legend attached to the display will say that the red colour corresponds
to a broadcast. Similarly, computation will be symbolized by blue,
the MPI_Reduce will be symbolized by
green and the duration of MPI_Barrier
will be symbolized by orange. So, in this case a lot of blue means
``good'', and a lot of other colours meand ``bad''.
The state definitions need not be specified by more than just one process. All processes will log their events independently, but it is enough for MPE to read state definitions from one process only. These will then be applied to all other processes.
The log itself is started by the call to
MPE_Start_log() and it is only now
that we specify what the events event1a through event4b really mean,
for example:
MPE_Log_event(event1a, 0, "start broadcast");
MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD);
MPE_Log_event(event1b, 0, "end broadcast");
Function MPE_Log_event logs the event
even1a labeled "start broadcast" just before calling MPI_Bcast,
and the logs the event event1b with the label "end broadcast" just
after function MPI_Bcast has returned. Everything between these
two events will be marked with the red colour on the jumpshot display,
as has been specified with the calls to MPE_Describe_state.
Of course, the programmer may well get it all wrong, which is exactly where automatic logging is advantageous.
The call to MPI_Barrier is similarly flanked with logged events:
MPE_Log_event(event4a,0,"Start Sync");
MPI_Barrier(MPI_COMM_WORLD);
MPE_Log_event(event4b,0,"End Sync");
and then we flank the whole computation section with events event2a and
event2b:
MPE_Log_event(event2a, 0, "start compute");
h = 1.0 / (double) n;
sum = 0.0;
for (i = myid + 1; i <= n; i += numprocs)
{
x = h * ((double)i - 0.5);
sum += f(x);
}
mypi = h * sum;
MPE_Log_event(event2b, 0, "end compute");
Finally we flank the call to MPI_Reduce with event3a and event3b:
MPE_Log_event(event3a, 0, "start reduce");
MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
MPE_Log_event(event3b, 0, "end reduce");
The call to
MPE_Finish_log("cpilog");
closes the log and writes it on
the file cpilog.clog. This last call has to be issued explicitly
only if you link the program with -lmpe only. If you were to link it
with -llmpe -lmpe instead, the file would be generated automatically
and its name would default to the name of the binary.
To summarize let me list the synopsis for each MPE log function discussed in this section:
int MPE_Init_log(void); int MPE_Start_log(void); int MPE_Stop_log(void); int MPE_Finish_log(char *logfilename); int MPE_Describe_state(int start, int end, char *name, char *color); int MPE_Describe_event(int event, char *name); int MPE_Log_event(int event, int intdata, char *chardata); int MPE_Log_get_event_number(void);