next up previous index
Next: Using the PE Debugger Up: Parallel Debugging Previous: Parallel Debugging

Deadlocks

Consider the following program:

#include <mpi.h>

#define PIXEL_WIDTH 50
#define PIXEL_HEIGHT 50

int First_Line = 0;
int Last_Line = 0;

void main(int argc, char *argv[])
{
  int numtask;
  int taskid;

  MPI_Init( &argc, &argv);
  MPI_Comm_size( MPI_COMM_WORLD, &numtask );
  MPI_Comm_rank( MPI_COMM_WORLD, &taskid );

  if ( taskid == 0 )
    collect_pixels( taskid, numtask );
  else
    compute_pixels( taskid, numtask );

  printf( "Task %d waiting to complete.\n", taskid );
  
  MPI_Barrier( MPI_COMM_WORLD );
  printf( "Task %d complete.\n", taskid );
  MPI_Finalize();
  exit();
}

compute_pixels( int taskid, int numtask )
{
  int section;
  int row, col;
  int pixel_data[2];
  MPI_Status stat;

  printf( "Compute #%d: checking in\n", taskid );

  section = PIXEL_HEIGHT / ( numtask - 1 );

  First_Line = ( taskid - 1 ) * section;
  Last_Line = taskid * section;

  for ( row = First_Line; row < Last_Line; row++ )
    for ( col = 0; col < PIXEL_WIDTH; col++ )
      {
        pixel_data[0] = row;
        pixel_data[1] = col;
        MPI_Send( pixel_data, 2, MPI_INT, 0, 0, MPI_COMM_WORLD );
      }

  printf( "Compute #%d: done sending.\n", taskid );
  return;
}

collect_pixels( int taskid, int numtask )
{
  int pixel_data[2];
  MPI_Status stat;
  int mx = PIXEL_HEIGHT * PIXEL_WIDTH;

  printf( "Control #%d: No. of nodes used is %d\n", taskid, numtask );
  printf( "Control: expect to receive %d messages\n", mx );
  
  while ( mx > 0 )
    {
      MPI_Recv( pixel_data, 2, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, 
                MPI_COMM_WORLD, &stat );
      mx--;
    }

  printf( "Control node #%d: done receiving.\n", taskid );
  return;
}

Compile this program with:

gustav@sp20:../MPI_hangs 09:28:22 !505 $ mpcc -g -o rtrace_bug rtrace_bug.c
gustav@sp20:../MPI_hangs 09:28:39 !506 $

And run it as follows:

<:28:39 !506 $ rtrace_bug -procs 4 -labelio yes -tracelevel 9
  0:Control #0: No. of nodes used is 4
  0:Control: expect to receive 2500 messages
  3:Compute #3: checking in
  2:Compute #2: checking in
  1:Compute #1: checking in
  2:Compute #2: done sending.
  2:Task 2 waiting to complete.
  3:Compute #3: done sending.
  3:Task 3 waiting to complete.
  1:Compute #1: done sending.
  1:Task 1 waiting to complete.
^CERROR: 0031-250  task 3: Interrupt
ERROR: 0031-250  task 0: Interrupt
ERROR: 0031-250  task 1: Interrupt
ERROR: 0031-250  task 2: Interrupt
gustav@sp20:../MPI_hangs 09:32:05 !507 $

This program will hang. When it does so, interrupt it with ^C. The -labelio yes option makes poe label output from the parallel tasks by task ids, so that you can clearly see who writes what.

Now invoke vt on the trace with the command:

gustav@sp20:../MPI_hangs 09:35:07 !508 $ vt -tracefile rtrace_bug.trc &
[1] 8302
gustav@sp20:../MPI_hangs 09:35:47 !509 $

If you get complaints about lack of colours, close down the vt, then close down netscape or any other ``colour hog'' that may be running on your display, and restart the vt.

Use the vt to check what happens towards the end of the trace: invoke the Interprocessor Communication window and the Source Code window. Observe that after some initial messaging activity, the program gets hung on

task 0
MPI_Recv in collect_pixels
other tasks
MPI_Barrier in main
Use magnifying glass to stretch the horizontal bars in the Interprocessor Communication window. Left click on the bars towards the end of the program in order to see the communication operations that the program hangs on.

Identify the cause for the hang and fix the program.


next up previous index
Next: Using the PE Debugger Up: Parallel Debugging Previous: Parallel Debugging
Zdzislaw Meglicki
2001-02-26