next up previous index
Next: Assignment Up: Parallel Debugging Previous: Deadlocks

Using the PE Debugger

Restart the same program, rtrace_bug, but without the tracefile option this time:

gustav@sp20:../MPI_hangs 12:22:06 !518 $ rtrace_bug -procs 4 -labelio yes
  0:Control #0: No. of nodes used is 4
  0:Control: expect to receive 2500 messages
  1:Compute #1: checking in
  3:Compute #3: checking in
  2:Compute #2: checking in
  2:Compute #2: done sending.
  2:Task 2 waiting to complete.
  3:Compute #3: done sending.
  3:Task 3 waiting to complete.
  1:Compute #1: done sending.
  1:Task 1 waiting to complete.
When the program hangs, in another window type:
<22:06 !504 $ ps -u gustav | grep poe | grep -v grep
    43098 34200  pts/1  0:00 poe
gustav@sp20:../MPI_hangs 13:09:01 !505 $
This gives us the POE process id number, which, in this case is 34200 (43098 is my uid number).

Now, in the same window attach the pedb debugger to the POE process:

gustav@sp20:../MPI_hangs 13:18:50 !507 $ pedb -a 34200
pedb Version 2, Release 3 -- Oct 13 1998 21:56:50
Warning: Cannot convert string "Rom10.500" to type FontStruct
A window will pop up listing all four tasks and their PID numbers on respective nodes.

Press Attach All button. The original window will go away, and you'll get a very large multi-panelled window filling the whole display. The Stack panel shows stack listings for all participating processes. You'll see that they all hang on internal MPI function calls, which do not have line numbers. But as you go down the stack you eventually find function calls with reference to appropriate line numbers within the code, e.g., task 0 should flag:

collect_pixels(), line 68
whereas the other tasks should flag:
main(), line 25
Double click on the line collect_pixels() in the task 0 stack listing: the code should now appear in the large window on the left with the offending line, in this case
MPI_Recv( pixel_data, 2, MPI_INT, MPI_ANY_SOURCE, MPI_ANY_TAG, ...
Go to the Global Data panel (it may be hidden, in which case you will need to stretch it a little so that it will show its window and push buttons) and right click on the Task 0 push button. A small menu will pop up, select Show All. Repeat this for all other tasks.

Look at the local data values. Observe that for task 0 mx is 100, which means that task 0 thinks that it is still going to receive 100 messages.

You can look the same way at the other tasks and you'll find that they're all stuck waiting at the barrier.

The problem is therefore solved. Task 0 expected to receive 2500 messages, but received 2400 only.

next up previous index
Next: Assignment Up: Parallel Debugging Previous: Deadlocks
Zdzislaw Meglicki