Running MPI
on the AVIDD Clusters is a little problematic because
there are several MPI versions and implementations there.
What is of special interest to us is MPI-2, because MPI-IO is
integral to it. For this reason we have installed a beta version
of Argonne's MPICH2 in /N/hpc/mpich2.
This directory should be mounted on all computational nodes and head nodes of
both IUPUI and IUB clusters.
You will also need to define an environmental variableMPD_USE_USER_CONSOLEand, possibly,LD_RUN_PATHandLD_LIBRARY_PATH(because there are dynamic run-time libraries in/N/hpc/mpich2/lib). WithoutMPD_USE_USER_CONSOLEdefined in your environment the MPICH2 execution engine is going to do some rather weird things with the name of the socket file with the sad effect that correct communication within the engine will not get established and the engine will shut down.But first and foremost you must create a file
.mpd.confin your home directory. This file must be readable by you only and you must be the only person allowed to write on it too:$ cd $ touch .mpd.conf $ chmod 600 .mpd.confNow you must enter the following line on this file:password=yoopeeReplace the word ``yoopee'' with your favourite password, of course. Do not use your AVIDD or your IU Net password. This password is for the MPICH2 MPD system only.I forgot to tell you about this in the laboratory class and that is why MPICH2 worked for me only. Then I forgot about it altogether, and ended up very perplexed and suspected MPICH2 of the most horrible bugs imaginable. MPICH2 is a beta release at this stage, to be sure, so surprises are possible, but it's not this bad.
The easiest way to get your PATH, MANPATH,
LD_LIBRARY_PATH, LD_RUN_PATH and
MPD_USE_USER_CONSOLE right is
to copy .bashrc, .bash_profile
and .inputrc files from my home directory to your home.
Proceed as follows. After you have logged on, issue the commands:
$ cd $ cp .bashrc .bashrc.ORIG $ cp .bash_profile .bash_profile.ORIG $ cp .inputrc .inputrc.ORIG $ cp ~gustav/.bashrc .bashrc $ cp ~gustav/.bash_profile .bash_profile $ cp ~gustav/.inputrc .inputrc $ chmod 755 .bashrc .bash_profile $ chmod 644 .inputrcHaving done this logout and login again.
If you know what you are doing and if you prefer to use shells other
than bash, have a look at these files and then set up your environment
similarly. The most important thing is to
$HOME/bin is in front of the command search
path, so that you can overwrite system commands with your own.
/N/hpc/mpich2/bin is
the second directory in your command search path, so that you'll get
MPI-2 start up commands, as well as Python-2.3 and other tools
used by MPI-2, in place of whatever may be currently installed
on the system - the system-wide version of Python, for example,
is older and doesn't work with MPICH2.
To check that everything is as it ought to be try the following commands:
gustav@bh1 $ cd gustav@bh1 $ ls -l .mpd.conf -rw------- 1 gustav ucs 16 Oct 2 18:58 .mpd.conf gustav@bh1 $ cat .mpd.conf password=frabjous gustav@bh1 $ env | grep PATH LD_RUN_PATH=/N/B/gustav/lib:/N/hpc/mpich2/lib LD_LIBRARY_PATH=/N/B/gustav/lib:/N/hpc/mpich2/lib MANPATH=/N/B/gustav/man:/N/B/gustav/share/man:/N/hpc/mpich2/man:\ N/hpc/mpich2/share/man:/usr/local/man:/usr/local/share/man:\ usr/man:/usr/share/man:/usr/X11R6/man PATH=/N/B/gustav/bin:/N/hpc/mpich2/bin:/usr/local/bin:/bin:\ usr/bin:/usr/X11R6/bin:/usr/pbs/bin:/usr/local/hpss:. gustav@bh1 $ env | grep MPD MPD_USE_USER_CONSOLE=yes gustav@bh1 $The first directory in
LD_RUN_PATH, LD_LIBRARY_PATH,
MANPATH and PATH, of course, should be replaced with
your private bin, lib and man.
Now you should run the following command on the IUB cluster
gustav@bh1 $ for i in `cat ~gustav/.bcnodes` > do > echo -n "$i: " > ssh $i date > done bc01-myri0: Wed Oct 1 16:12:56 EST 2003 bc02-myri0: Wed Oct 1 16:12:56 EST 2003 bc03-myri0: Wed Oct 1 16:12:56 EST 2003 bc04-myri0: Wed Oct 1 16:12:57 EST 2003 ... bc93-myri0: Wed Oct 1 16:13:39 EST 2003 bc94-myri0: Wed Oct 1 16:13:39 EST 2003 bc95-myri0: Wed Oct 1 16:13:39 EST 2003 bc96-myri0: Wed Oct 1 16:13:40 EST 2003 gustav@bh1 $and similarly on the IUPUI cluster:
gustav@ih1 $ for i in `cat ~gustav/.icnodes` > do > echo -n "$i: " > ssh $i date > done ic01-myri0: Wed Oct 1 16:15:14 EST 2003 ic02-myri0: Wed Oct 1 16:15:14 EST 2003 ic03-myri0: Wed Oct 1 16:15:15 EST 2003 ic04-myri0: Wed Oct 1 16:15:15 EST 2003 ... ic93-myri0: Wed Oct 1 16:18:24 EST 2003 ic94-myri0: Wed Oct 1 16:18:24 EST 2003 ic95-myri0: Wed Oct 1 16:18:24 EST 2003 ic97-myri0: Wed Oct 1 16:18:25 EST 2003Please let me know if these commands hang on any of the nodes. The files
~gustav/.bcnodes and ~gustav/.icnodes
contain the lists of currently functional computational nodes
with working Myrinet interfaces
both at IUPUI and IUB. These lists may change every now and
then, in which case we may have to repeat this procedure.
The purpose of this procedure is to populate your
~/.ssh/known_hosts file with the computational nodes' keys. The
ssh command inserts the key automatically in your
known_hosts if it is not there. But in the process it writes
a message on standard output that may confuse the MPICH2
startup scripts.
Now you are almost ready to run your first MPI job. First copy two more files from my home directory. Do it as follows:
$ cd $ [ -d bin ] || mkdir bin $ [ -d PBS ] || mkdir PBS $ cp ~gustav/bin/hellow2 bin $ chmod 755 bin/hellow2 $ cp ~gustav/PBS/mpi.sh PBS $ chmod 755 PBS/mpi.shNow submit the job to PBS as follows:
$ cd ~/PBS $ qsub mpi.sh $ qstat | grep `whoami` 21303.bh1 mpi gustav 0 R bg $ !! 21303.bh1 mpi gustav 0 R bg $After you have submitted the job, you can monitor its progress through the PBS system with
$ qstat | grep `whoami`every now and then. But the job should run quickly, unless the system is very busy. If everything works as it ought to, you will find
mpi_err and mpi_out in your working directory
after the job completes. The first file should be empty and the second
will contain the output of the job, which should be similar to:$ cat mpi_err $ cat mpi_out Local MPD console on bc68 bc68_33575 bc47_34123 bc46_34056 bc49_34697 bc48_33551 bc53_34095 bc54_35385 bc55_34714 time for 100 loops = 0.124682068825 seconds 0: bc68 2: bc46 1: bc47 3: bc49 4: bc48 6: bc54 5: bc53 7: bc55 bc68: hello world from process 0 of 8 bc47: hello world from process 1 of 8 bc46: hello world from process 2 of 8 bc49: hello world from process 3 of 8 bc48: hello world from process 4 of 8 bc53: hello world from process 5 of 8 bc54: hello world from process 6 of 8 bc55: hello world from process 7 of 8 $This should work on both AVIDD clusters.
Let us have a look at the PBS script:
gustav@bh1 $ cat mpi.sh #PBS -S /bin/bash #PBS -N mpi #PBS -o mpi_out #PBS -e mpi_err #PBS -q bg #PBS -m a #PBS -V #PBS -l nodes=8 NODES=8 HOST=`hostname` echo Local MPD console on $HOST # Specify Myrinet interfaces on the hostfile. grep -v $HOST $PBS_NODEFILE | sed 's/$/-myri0/' > $HOME/mpd.hosts # Boot the MPI2 engine. mpdboot --totalnum=$NODES --file=$HOME/mpd.hosts sleep 10 # Inspect if all MPI nodes have been activated. mpdtrace -l # Check the connectivity. mpdringtest 100 # Check if you can run trivial non-MPI jobs. mpdrun -l -n $NODES hostname # Execute your MPI program. mpiexec -n $NODES hellow2 # Shut down the MPI2 engine and exit the PBS script. mpdallexit exit 0 gustav@bh1 $There is a new PBS directive here, which we haven't encountered yet. The option
-l lets
you specify the list of resources
required for the job. In this case there is only one item in the
list, nodes=8, and this item states that you need eight nodes
from the PBS in order to run the job. PBS is going to return the names
of the nodes on the file, whose name is conveyed in the environmental
variable
PBS_NODEFILE. The nodes are listed on this file one
per line.
Also observe that we have used the -V directive. By doing
this we have imported
all environmental variables, including the
wretched MPD_USE_USER_CONSOLE,
which is essential for
MPICH2.
The first thing we do in the script is to convert the node names
to their Myrinet equivalents. This is easy to do, because
the Myrinet names are obtained by appending -myri0 to
the name of the node, returned on $PBS_NODEFILE. This is
what the first command in the script does:
grep -v $HOST $PBS_NODEFILE | sed 's/$/-myri0/' > $HOME/mpd.hostsThere is one complication here though. We are removing from this list the name of the node on which the script runs with the command
grep -v $HOST. This is because MPICH2 is going to create
a process on this host anyway. If we left this host's name on the file,
MPICH2 would create two processes on it.
Once the names returned on $PBS_NODEFILE have been converted
to the Myrinet names, we save them on $HOME/mpd.hosts.
Now we are ready to start the MPICH2
engine. The command that does
this is
mpdboot --totalnum=$NODES --file=$HOME/mpd.hostsProgram
mpdboot
is a Python-2.3 script that boots
the MPICH2 engine by spawning MPICH2
supervisory processes, called
mpds (pronounced ``em-pea-dee-s''), on nodes specified
on $HOME/mpd.hosts.
We have to give mpdboot a few seconds to complete its job.
We do this by telling the script to sleep for 10 seconds.
The mpds are spawned by ssh, which is why we had to get
all the keys in place in the first place. Silly problems may show up
if the keys are not there.
The command
mpdtrace -linspects the MPICH2 engine and lists names of all nodes on which mpds are running. With the
-l option, it also lists the
names of the sockets used by the mpds to communicate with each other.
The next command
mpdringtest 100times a simple message going around the ring of mpds, in this case 100 times.
These two commands, mpdtrace and mpdringtest, tell
us that the MPICH2 engine is ready. We can now execute programs on
it. These don't have to be MPICH2 programs though. You can
execute any UNIX program under the MPICH2 engine. But if they are
not MPICH2 programs, they will not communicate with each other.
You will just get a number of independent instantiations of those
programs running on individual nodes. The script demonstrates this
by running the UNIX command hostname under the MPICH2 engine:
mpdrun -l -n $NODES hostnameThe option
-l asks
mpdrun to attach process labels
to any output that the instantiations of hostname on MPICH2
nodes may produce.
At long last we commence the execution of a real MPICH2 program.
The program's name is hellow2. It is an MPI version of
``Hello World''. This program should be picked up from your
$HOME/bin assuming that it is present in your command
search PATH. The command to run
this program under the MPICH2 engine is
mpiexec -n $NODES hellowObserve that we don't have to use all nodes given to us by the PBS, but, of course, it would be silly not to, unless there is a special reason for it. Program
mpiexec is not specific
to MPICH2. MPI-2 specification says that such a program must be
provided for the execution of MPI jobs. There was no such specification
in the original MPI.
After hellow2 exits, we are done and we shut down the MPICH2
engine with
mpdallexit