next up previous index
Next: Exercises Up: Introduction Previous: Programming Examples

   
Running MPI on the AVIDD Clusters

Running MPI  on the AVIDD Clusters is a little problematic because there are several MPI versions and implementations there. What is of special interest to us is MPI-2, because MPI-IO is integral to it. For this reason we have installed a beta version of Argonne's MPICH2 in /N/hpc/mpich2. This directory should be mounted on all computational nodes and head nodes of both IUPUI and IUB clusters.

You will also need to define an environmental variable MPD_USE_USER_CONSOLE and, possibly, LD_RUN_PATH and LD_LIBRARY_PATH (because there are dynamic run-time libraries in /N/hpc/mpich2/lib). Without MPD_USE_USER_CONSOLE defined in your environment the MPICH2 execution engine is going to do some  rather weird things with the name of the socket file with the sad effect that correct communication within the engine will not get established and the engine will shut down.

But first and foremost you must create a file  .mpd.conf in your home directory. This file must be readable by you only and you must be the only person allowed to write on it too:

$ cd
$ touch .mpd.conf
$ chmod 600 .mpd.conf
Now you must enter the following line on this file:
password=yoopee
Replace the word ``yoopee'' with your favourite password, of course. Do not use your AVIDD or your IU Net password. This password is for the MPICH2 MPD system only.

I forgot to tell you about this in the laboratory class and that is why MPICH2 worked for me only. Then I forgot about it altogether, and ended up very perplexed and suspected MPICH2 of the most horrible bugs imaginable. MPICH2 is a beta release at this stage, to be sure, so surprises are possible, but it's not this bad.

The easiest way to get your PATH, MANPATH, LD_LIBRARY_PATH, LD_RUN_PATH and MPD_USE_USER_CONSOLE right is to copy .bashrc, .bash_profile and .inputrc files from my home directory to your home.

Proceed as follows. After you have logged on, issue the commands:

$ cd 
$ cp .bashrc .bashrc.ORIG
$ cp .bash_profile .bash_profile.ORIG
$ cp .inputrc .inputrc.ORIG
$ cp ~gustav/.bashrc .bashrc
$ cp ~gustav/.bash_profile .bash_profile
$ cp ~gustav/.inputrc .inputrc
$ chmod 755 .bashrc .bash_profile
$ chmod 644 .inputrc
Having done this logout and login again.

If you know what you are doing and if you prefer to use shells other than bash, have a look at these files and then set up your environment similarly. The most important thing is to

1.
Ensure that your $HOME/bin is in front of the command search path, so that you can overwrite system commands with your own.
2.
Ensure that the MPICH2 directory /N/hpc/mpich2/bin is the second directory in your command search path, so that you'll get MPI-2 start up commands, as well as Python-2.3 and other tools used by MPI-2, in place of whatever may be currently installed on the system - the system-wide version of Python, for example, is older and doesn't work with MPICH2.

To check that everything is as it ought to be try the following commands:

gustav@bh1 $ cd
gustav@bh1 $ ls -l .mpd.conf
-rw-------    1 gustav  ucs            16 Oct  2 18:58 .mpd.conf
gustav@bh1 $ cat .mpd.conf
password=frabjous
gustav@bh1 $ env | grep PATH
LD_RUN_PATH=/N/B/gustav/lib:/N/hpc/mpich2/lib
LD_LIBRARY_PATH=/N/B/gustav/lib:/N/hpc/mpich2/lib
MANPATH=/N/B/gustav/man:/N/B/gustav/share/man:/N/hpc/mpich2/man:\
N/hpc/mpich2/share/man:/usr/local/man:/usr/local/share/man:\
usr/man:/usr/share/man:/usr/X11R6/man
PATH=/N/B/gustav/bin:/N/hpc/mpich2/bin:/usr/local/bin:/bin:\
usr/bin:/usr/X11R6/bin:/usr/pbs/bin:/usr/local/hpss:.
gustav@bh1 $ env | grep MPD
MPD_USE_USER_CONSOLE=yes
gustav@bh1 $
The first directory in LD_RUN_PATH, LD_LIBRARY_PATH, MANPATH and PATH, of course, should be replaced with your private bin, lib and man.

Now you should run the following command on the IUB cluster

gustav@bh1 $ for i in `cat ~gustav/.bcnodes`
> do
>    echo -n "$i: "
>    ssh $i date
> done
bc01-myri0: Wed Oct  1 16:12:56 EST 2003
bc02-myri0: Wed Oct  1 16:12:56 EST 2003
bc03-myri0: Wed Oct  1 16:12:56 EST 2003
bc04-myri0: Wed Oct  1 16:12:57 EST 2003
...
bc93-myri0: Wed Oct  1 16:13:39 EST 2003
bc94-myri0: Wed Oct  1 16:13:39 EST 2003
bc95-myri0: Wed Oct  1 16:13:39 EST 2003
bc96-myri0: Wed Oct  1 16:13:40 EST 2003
gustav@bh1 $
and similarly on the IUPUI cluster:
gustav@ih1 $ for i in `cat ~gustav/.icnodes`
> do 
>    echo -n "$i: "
>    ssh $i date
> done
ic01-myri0: Wed Oct  1 16:15:14 EST 2003
ic02-myri0: Wed Oct  1 16:15:14 EST 2003
ic03-myri0: Wed Oct  1 16:15:15 EST 2003
ic04-myri0: Wed Oct  1 16:15:15 EST 2003
...
ic93-myri0: Wed Oct  1 16:18:24 EST 2003
ic94-myri0: Wed Oct  1 16:18:24 EST 2003
ic95-myri0: Wed Oct  1 16:18:24 EST 2003
ic97-myri0: Wed Oct  1 16:18:25 EST 2003
Please let me know if these commands hang on any of the nodes. The files ~gustav/.bcnodes and ~gustav/.icnodes contain the lists of currently functional computational nodes with working Myrinet interfaces both at IUPUI and IUB. These lists may change every now and then, in which case we may have to repeat this procedure.

The purpose of this procedure is to populate your ~/.ssh/known_hosts file with the computational nodes' keys. The ssh command inserts the key automatically in your known_hosts if it is not there. But in the process it writes a message on standard output that may confuse the MPICH2 startup scripts.

Now you are almost ready to run your first MPI job. First copy two more files from my home directory. Do it as follows:

$ cd
$ [ -d bin ] || mkdir bin
$ [ -d PBS ] || mkdir PBS
$ cp ~gustav/bin/hellow2 bin
$ chmod 755 bin/hellow2
$ cp ~gustav/PBS/mpi.sh PBS
$ chmod 755 PBS/mpi.sh
Now submit the job to PBS as follows:
$ cd ~/PBS
$ qsub mpi.sh
$ qstat | grep `whoami`
21303.bh1        mpi              gustav                  0 R bg              
$ !!
21303.bh1        mpi              gustav                  0 R bg              
$
After you have submitted the job, you can monitor its progress through the PBS system with
$ qstat | grep `whoami`
every now and then. But the job should run quickly, unless the system is very busy. If everything works as it ought to, you will find mpi_err and mpi_out in your working directory after the job completes. The first file should be empty and the second will contain the output of the job, which should be similar to:
$ cat mpi_err
$ cat mpi_out
Local MPD console on bc68
bc68_33575
bc47_34123
bc46_34056
bc49_34697
bc48_33551
bc53_34095
bc54_35385
bc55_34714
time for 100 loops = 0.124682068825 seconds
0: bc68
2: bc46
1: bc47
3: bc49
4: bc48
6: bc54
5: bc53
7: bc55
bc68: hello world from process 0 of 8
bc47: hello world from process 1 of 8
bc46: hello world from process 2 of 8
bc49: hello world from process 3 of 8
bc48: hello world from process 4 of 8
bc53: hello world from process 5 of 8
bc54: hello world from process 6 of 8
bc55: hello world from process 7 of 8
$
This should work on both AVIDD clusters.

Let us have a look at the PBS script:

gustav@bh1 $ cat mpi.sh
#PBS -S /bin/bash
#PBS -N mpi
#PBS -o mpi_out
#PBS -e mpi_err
#PBS -q bg
#PBS -m a
#PBS -V
#PBS -l nodes=8
NODES=8
HOST=`hostname`
echo Local MPD console on $HOST
# Specify Myrinet interfaces on the hostfile.
grep -v $HOST $PBS_NODEFILE | sed 's/$/-myri0/' > $HOME/mpd.hosts
# Boot the MPI2 engine.
mpdboot --totalnum=$NODES --file=$HOME/mpd.hosts 
sleep 10
# Inspect if all MPI nodes have been activated.
mpdtrace -l
# Check the connectivity.
mpdringtest 100
# Check if you can run trivial non-MPI jobs.
mpdrun -l -n $NODES hostname
# Execute your MPI program.
mpiexec -n $NODES hellow2
# Shut down the MPI2 engine and exit the PBS script.
mpdallexit
exit 0
gustav@bh1 $
There is a new PBS directive here, which we haven't encountered yet. The option -l lets  you specify the list of resources required for the job. In this case there is only one item in the list, nodes=8, and this item states that you need eight nodes from the PBS in order to run the job. PBS is going to return the names of the nodes on the file, whose name is conveyed in the environmental variable  PBS_NODEFILE. The nodes are listed on this file one per line.

Also observe that we have used the -V directive. By doing this we have imported  all environmental variables, including the wretched MPD_USE_USER_CONSOLE, which  is essential for MPICH2.

The first thing we do in the script is to convert the node names to their Myrinet equivalents. This is easy to do, because the Myrinet names are obtained by appending -myri0 to the name of the node, returned on $PBS_NODEFILE. This is what the first command in the script does:

grep -v $HOST $PBS_NODEFILE | sed 's/$/-myri0/' > $HOME/mpd.hosts
There is one complication here though. We are removing from this list the name of the node on which the script runs with the command grep -v $HOST. This is because MPICH2 is going to create a process on this host anyway. If we left this host's name on the file, MPICH2 would create two processes on it.

Once the names returned on $PBS_NODEFILE have been converted to the Myrinet names, we save them on $HOME/mpd.hosts. Now we are ready to start the MPICH2  engine. The command that does this is

mpdboot --totalnum=$NODES --file=$HOME/mpd.hosts
Program mpdboot  is a Python-2.3  script that boots the MPICH2 engine by spawning MPICH2  supervisory processes, called mpds (pronounced ``em-pea-dee-s''), on nodes specified on $HOME/mpd.hosts.

We have to give mpdboot a few seconds to complete its job. We do this by telling the script to sleep for 10 seconds. The mpds are spawned by ssh, which is why we had to get all the keys in place in the first place. Silly problems may show up if the keys are not there.

The command

mpdtrace -l
inspects the MPICH2  engine and lists names of all nodes on which mpds are running. With the -l option, it also lists the names of the sockets used by the mpds to communicate with each other.

The next command

mpdringtest 100
times a simple  message going around the ring of mpds, in this case 100 times.

These two commands, mpdtrace and mpdringtest, tell us that the MPICH2 engine is ready. We can now execute programs on it. These don't have to be MPICH2 programs though. You can execute any UNIX program under the MPICH2 engine. But if they are not MPICH2 programs, they will not communicate with each other. You will just get a number of independent instantiations of those programs running on individual nodes. The script demonstrates this by running the UNIX command hostname under the MPICH2 engine:

mpdrun -l -n $NODES hostname
The option -l asks  mpdrun to attach process labels to any output that the instantiations of hostname on MPICH2 nodes may produce.

At long last we commence the execution of a real MPICH2 program. The program's name is hellow2. It is an MPI version of ``Hello World''. This program should be picked up from your $HOME/bin assuming that it is present in your command search PATH. The command to run this program under the MPICH2 engine is

mpiexec -n $NODES hellow
Observe that  we don't have to use all nodes given to us by the PBS, but, of course, it would be silly not to, unless there is a special reason for it. Program mpiexec is not specific to MPICH2. MPI-2 specification says that such a program must be provided for the execution of MPI jobs. There was no such specification in the original MPI.

After hellow2 exits, we are done and we shut down the MPICH2 engine with

mpdallexit



 
next up previous index
Next: Exercises Up: Introduction Previous: Programming Examples
Zdzislaw Meglicki
2004-04-29