Sometimes you may wish to interact with your production jobs by conversing with them via a command line or an X11 interface. A simple way to arrange for this mode of execution is to ask LoadLeveler to give you an xterm window running on a node serving a given class. Once the window appears on your X11 display, you can enter whatever commands you wish and run your application interactively, the same way you would run it on one of the front-end nodes (i.e., s1n01 or s1n02).
The difference is that the node given to you by LoadLeveler will almost always be less loaded than the front end nodes. If you request the xterm to run on a node of pool 2, you will have the whole node to yourself. On nodes of pool 1 you may have to share a node with up to 3 other jobs. Another difference is that interactive jobs running under LoadLeveler will be accounted. Although you may not find it quite so exciting from your point of view, from our point of view the situation is diametrically the opposite, to the extent that we are prepared to kill without warning or mercy any production jobs found running on the front end nodes - interactive or not!
In order to run xterm under LoadLeveler, first edit a LoadLeveler script which should look as follows:
# @ output = $(job_name).out # @ error = $(job_name).err # @ job_type = serial # @ class = half_hour # @ notification = always # @ environment = COPY_ALL # @ executable = /usr/bin/X11/xterm # @ arguments = -ls -sb -sl 300 -n $(job_name) -T $(job_name) # @ queueand save it, say, on xterm.ll.
Before submitting the script to LoadLeveler ensure that X-windows programs running on the SP will be allowed access to your X11 display. The easiest way to do that is to add an appropriate authorisation entry for your display to the .Xauthority file on the SP with the command such as
$ xauth add nazgul.qpsf.edu.au:0 MIT-MAGIC-COOKIE-1 62dd45bc706a4415357c7973386d7112Of course, you should replace the name of the display (nazgul.qpsf.edu.au:0), the protocol name (MIT-MAGIC-COOKIE-1), and the key data (62dd45bc706a4415357c7973386d7112) with whatever is appropriate for your X11 display.
Now define the display itself and submit the job:
$ export DISPLAY=nazgul.qpsf.edu.au:0.0 $ llsubmit xterm.llThe value of the DISPLAY will be passed on to xterm, because I have used the
# @ environment = COPY_ALLdirective.
In the same way you can run any other interactive X11 application under LoadLeveler. For example, the following LoadLeveler script will run GNU emacs
# @ output = $(job_name).out # @ error = $(job_name).err # @ job_type = serial # @ class = half_hour # @ notification = always # @ environment = COPY_ALL # @ executable = /opt/gnu/bin/emacs # @ queueRemember that you must have the environmental variable DISPLAY bound to your X11 display for this to work.
If your login shell is either csh or tcsh, LoadLeveler may have problems with cancellation of your interactive jobs, because of the way those jobs will be spawned through an intermediary csh, which won't go away. The way around this problem is to replace the script discussed above with the following:
# @ shell = /opt/gnu/bin/bash # @ output = xterm.out # @ error = xterm.err # @ job_type = serial # @ class = half_hour_dedicated # @ notification = always # @ environment = COPY_ALL # @ queue exec xterm -ls -sb -sl 300 -n `hostname` -T `hostname`This script tells LoadLeveler explicitly to spawn the job by using exec from within an /opt/gnu/bin/bash process. The effect of this will be such that bash will be replaced with xterm, so that llcancel will send its signal to the right process this time.
Another problem that may occur sometimes, and that will occur if your login shell is csh or tcsh, is that the X11 server managing your display will not receive the correct X11 authority information (protocol and key-data) from xterm in this context. This looks like a bug. We don't have a proper fix for it yet. In that case you will have to open the server to the world by issuing the command:
$ xhost +before submitting the job with llsubmit. Once you get the window, you can close the server again with the command:
$ xhost -From this point onwards, information in your .Xauthority file will be correctly passed to your X11 server by other X11 applications invoked from that window.
There is a class of jobs, which many users who lack UNIX skills think of as interactive jobs, but, which, in fact, aren't interactive at all. Applications which take input from a command line in some sort of an application dependent language fall in that category. Examples are Common Lisp, Smalltalk, Scheme, Matlab, Xplor, GeneHunter, etc.
Often a user has a command file prepared, which must be loaded into an application from an interactive session. Once the command file is loaded the application begins executing a program, which may take hours to complete. A user at that stage goes away leaving an active telnet connection or an X11 window on the display. The window was needed only to load the file and perhaps issue some start-up commands for the computation.
Jobs like that are not interactive at all and they can and should be run under LoadLeveler without asking for an xterm window and forking an unnecessary login shell.
The way to execute such jobs is to use the here-input feature of UNIX shells:
$ my_command << EOF one_line_of_input another_line_of_input EOF
Here is an example:
# @ shell = /opt/gnu/bin/bash # @ initialdir = /home/qpsf/gustav/src/try # @ output = hello-lisp.out # @ error = hello-lisp.err # @ job_type = serial # @ class = half_hour # @ notification = always # @ environment = COPY_ALL # @ queue clisp -q << EOF (load "hello.fas") (hello) EOFHere I have a Common Lisp program stored on a file hello.fas. Normally, in order to execute that program I would have to enter a Common Lisp environment with the command clisp. Then I would have to load the file which contains the program, and finally I would have to evaluate the function, which has been defined on hello.fas, by typing (hello).
But all that can be accomplished also by typing
clisp -q << EOF (load "hello.fas") (hello) EOFin a shell script without having to invoke xterm first. A UNIX shell (csh, tcsh, sh, ksh or bash) on encountering a construct like that will take the text enclosed by the
<< EOF ... EOFconstruct and will pass it to the program, in this case clisp, as if it has been typed in by the user.
Long Matlab, GeneHunter, Xplor, etc., computations can and should be run like that too.
By a simple batch job I mean running just one program under LoadLeveler, a little like in the interactive example above, without any pre or post-processing. Consequently the LoadLeveler script looks quite similar too.
Here is a simple hello world program written in Emacs Lisp:
(defun hello () (princ "hello world\n"))To execute this program in the Emacs batch mode under LoadLeveler I had first saved it on a file hello.el. Then I edited the following LoadLeveler script:
# @ initialdir = /home/qpsf/gustav/src/try # @ executable = /opt/gnu/bin/emacs # @ arguments = -batch -l hello.el -f hello # @ output = emacs-batch.out # @ error = emacs-batch.err # @ job_type = serial # @ class = half_hour # @ notification = always # @ environment = COPY_ALL # @ queueand saved it on emacs-batch.ll.
The script was submitted to LoadLeveler with the command:
$ llsubmit emacs-batch.ll
When the job had finished its execution there was a file emacs-batch.out left in my ~/src/try subdirectory:
$ cat emacs-batch.out hello world $
An alternative way is to make the primary LoadLeveler job a shell script, and to execute emacs from within it. In that case you must not use the #@executable and the #@arguments directives. Instead, use the #@shell directive, to specify the shell of choice for your command file. Here is an example:
# @ shell = /opt/gnu/bin/bash # @ initialdir = /home/qpsf/gustav/src/try # @ output = emacs-batch.out # @ error = emacs-batch.err # @ job_type = serial # @ class = half_hour # @ notification = always # @ environment = COPY_ALL # @ queue emacs -batch -l hello.el -f hello
Often you may wish to perform some preliminary manipulations on your data files before passing them on to your application for execution, and after that's done, you may wish to do some clean-up work, perhaps making sure that various scratch files have been removed, etc.
The way to do that is to use the second approach presented in the previous section, i.e., to submit a shell script to LoadLeveler.
Below is an example of a job like that. This job comprises three steps:
# @ shell = /opt/gnu/bin/bash
# @ initialdir = /home/qpsf/gustav/src/try
# @ output = $(job_name).out
# @ error = $(job_name).err
# @ job_type = serial
# @ class = half_hour
# @ notification = always
# @ environment = COPY_ALL
# @ queue
env | grep LOADL | \
awk ' BEGIN {
{ printf "(defun llenv ()\n" }
{ printf " (princ \"LoadLeveler variables:\\n\") " }
}
{ printf " (princ \"\t%s\\n\")\n", $0 }
END { print ")"} ' > llenv.el
emacs -batch -l llenv.el -f llenv > llenv.out
rm llenv.el
Having saved this LoadLeveler script on env.ll I have submitted it with
$ llsubmit env.lland then viewed the results of the run as follows:
$ cat llenv.out
LoadLeveler variables:
LOADL_STEP_CLASS=half_hour
LOADL_STEP_ARGS=
LOADL_STEP_ID=s1n01.qpsf.edu.au.17417.0
LOADL_STARTD_PORT=9611
LOADL_STEP_NICE=0
LOADL_STEP_IN=/dev/null
LOADL_STEP_ERR=s1n01.qpsf.edu.au.17417.err
LOADL_STEP_GROUP=qpsf
LOADL_STEP_NAME=0
LOADL_STEP_ACCT=
LOADL_STEP_TYPE=SERIAL
LOADL_STEP_OWNER=gustav
LOADL_ACTIVE=1.2.1.11
LOADL_STEP_COMMAND=env.ll
LOADL_JOB_NAME=s1n01.qpsf.edu.au.17417
LOADL_STEP_OUT=s1n01.qpsf.edu.au.17417.out
LOADL_STEP_INITDIR=/home/qpsf/gustav/src/try
LOADL_PROCESSOR_LIST=s1n10.qpsf.edu.au
LOADLBATCH=yes
$
Observe that you can make use of all those LoadLeveler environmental variables in your LoadLeveler scripts. In particular, the variable LOADL_PROCESSOR_LIST is often used in manipulating parallel jobs. In this case it comprises only one machine name, because the job is SERIAL.
An important example of a sequential job which often has to do some pre and post-processing of data is Gaussian. There are several issues Gaussian users should pay attention to:
Here is an example of a sequential Gaussian job. For discussion go to ``Sequential Gaussian: Discussion'' towards the end of this section.
# @ shell = /opt/gnu/bin/bash
# @ initialdir = /home/qpsf/gustav/Gaussian/test161
# @ output = $(job_name).out
# @ error = $(job_name).err
# @ job_type = serial
# @ class = two_hour_dedicated
# @ notification = always
# @ environment = COPY_ALL
# @ queue
export GAUSS_CHK_ROOT=test161
export GAUSS_OUTPUT=test161.out
export GAUSS_SCRDIR=/ptmp/gustav/test161
#
export g94root=/opt/gaussian
. $g94root/g94/bsd/g94.profile
#
if [ \! -d $GAUSS_SCRDIR ]
then
mkdir $GAUSS_SCRDIR
else
if [ -n "$GAUSS_CLEAN_SCRATCH" ]
then
echo cleaning $GAUSS_SCRDIR
(
cd $GAUSS_SCRDIR
rm -rf *
)
fi
fi
#
/bin/time g94 << EOF >> $GAUSS_OUTPUT
%chk=$GAUSS_CHK_ROOT
#p uhf/sto-3g test pop=full scf=conventional guess=mix
Gaussian Test Job 161 (Part 1):
Ketene, bent TS, UHF for later CAS-UNO
0 1
C
X 1 1.0
O 1 CO 2 TH
C 1 CC 2 TH 3 180.0
H 4 CH 1 HCC 2 DI
H 4 CH 1 HCC 2 -DI
CC 2.225
CO 1.178
TH 38.17
CH 1.12
HCC 102.9
DI 52.6
--Link1--
%chk=$GAUSS_CHK_ROOT
%nosave
#p cas(4,uno,4,qc)/sto-3g test scf=conventional pop=full guess=read
Gaussian Test Job 161 (Part 2):
Ketene, bent TS, sto 3g CASUNO
0 1
C
X 1 1.0
O 1 CO 2 TH
C 1 CC 2 TH 3 180.0
H 4 CH 1 HCC 2 DI
H 4 CH 1 HCC 2 -DI
CC 2.225
CO 1.178
TH 38.17
CH 1.12
HCC 102.9
DI 52.6
EOF
#
if grep "Normal termination of Gaussian 94" $GAUSS_OUTPUT
then
echo Test161 finished successfully.
echo -n Cleaning scratch directory ...
rm -rf $GAUSS_SCRDIR
echo done.
else
echo Gaussian run did not terminate normally
echo Checkfile is ${GAUSS_CHK_ROOT}.chk in $LOADL_STEP_INITDIR.
fi
A few words of explanations. The LoadLeveler directives in this job merely declare the shell, the initial directory, general output and error files, the job type, and the LoadLeveler class to submit the job to.
Before Gaussian itself is invoked I define four environmental variables related to this run:
. $g94root/g94/bsd/g94.profileGaussian environment files for csh and tcsh also can be found in $g94root/g94/bsd.
The next step checks if the Gaussian scratch directory already exists and creates it if it doesn't. If it does, I check if another environmental variable, GAUSS_CLEAN_SCRATCH, has been defined, and clean the scratch directory before running Gaussian.
Finally, Gaussian itself is invoked. The run is timed (with /bin/time) and Gaussian input is taken from the following text until the string EOF is encountered. It is here that I redirect Gaussian output to $GAUSS_OUTPUT. Observe that the shell variable $GAUSS_CHK_ROOT is used in the input. That variable will be replaced by its value before the input is passed to Gaussian.
When Gaussian exits, the script inspects the log written on $GAUSS_OUTPUT searching for the occurrence of a string "Normal termination of Gaussian 94". If the string is found, the scratch directory is removed. Otherwise the directory is left in place and a message about the location of the checkpoint file is written on $(job_name).out.
The postprocessing or preprocessing of data may sometimes be so involved that it should be performed as a separate LoadLeveler job, rather than combined with the main computational task.
The simplest way to procede in such situation would be to submit one LoadLeveler job, then wait for it to finish execution, and then to submit the second job. The submission of the second job could be performed from within the LoadLeveler script of the first job.
The following two scripts split the example from the ``How to submit a more complex sequential batch job'' section into two steps.
The first script, called env-1.ll uses commands env, grep, and awk to generate a data file, in this case an Emacs Lisp code, which is saved on llenv.el (remember that programs are data, and, in particular, in case of Lisp, there is no semantic difference between programs and data: both are stored in the same data section of a Lisp process, and both can be modified dynamically during program execution). Once awk exists the script checks if the data file is there (for example, an error may have occurred while executing awk). It also checks if the second LoadLeveler script can be found in its working directory. If both files are present, the second script is submitted with the llsubmit command.
# @ shell = /opt/gnu/bin/bash
# @ initialdir = /home/qpsf/gustav/src/try
# @ output = env-1.out
# @ error = env-1.err
# @ job_type = serial
# @ class = half_hour
# @ notification = always
# @ environment = COPY_ALL
# @ queue
env | grep LOADL | \
awk ' BEGIN {
{ printf "(defun llenv ()\n" }
{ printf " (princ \"LoadLeveler variables:\\n\") " }
}
{ printf " (princ \"\t%s\\n\")\n", $0 }
END { print ")"} ' > llenv.el
if [ -f llenv.el -a -f env-2.ll ]
then
llsubmit env-2.ll
fi
The second script is called env-2.ll. First it checks if the file llenv.el exists. Even though we have already checked that within env-1.ll, here we do so again, because the scripts are separate, and there is always a possibility that env-2.ll may have been submitted without running env-1.ll first. If the file exists, we run emacs on it, if it doesn't, we flag an error and exit. The data file itself, llenv.el, is removed after emacs had its way with it.
# @ shell = /opt/gnu/bin/bash # @ initialdir = /home/qpsf/gustav/src/try # @ output = env-2.out # @ error = env-2.err # @ job_type = serial # @ class = half_hour # @ notification = always # @ environment = COPY_ALL # @ queue if [ -f llenv.el ] then emacs -batch -l llenv.el -f llenv > llenv.out else echo Error: env-2.ll job: llenv.el not found exit 1 fi rm llenv.el
It is easy to restructure the two scripts above into one script, which first performs one task, then resubmits itself, and on the second invocation performs the second task.
In order to do that, the script must be able to find out on its own, whether its current instantiation is the first or the second one. If you have a creepy feeling now that we are getting close to talking about reincarnation, well, yes, you're quite right. That's exactly what we're talking about! How can a process know that it already lived before?
The answer is: by inspecting its environment and finding a particular variable set. The variable would be set during the first instantiation of the LoadLeveler job. It would not exist at all outside of those LoadLeveler jobs, i.e., the user should make sure that it is unset in the user's normal environment.
Here's the script:
# @ shell = /opt/gnu/bin/bash
# @ initialdir = /home/qpsf/gustav/src/try
# @ output = $(job_name).out
# @ error = $(job_name).err
# @ job_type = serial
# @ class = half_hour
# @ notification = always
# @ environment = COPY_ALL
# @ queue
if [ -z "$ENV_SECOND_SUBMISSION" ]
then
env | grep LOADL | \
awk ' BEGIN {
{ printf "(defun llenv ()\n" }
{ printf " (princ \"LoadLeveler variables:\\n\") " }
}
{ printf " (princ \"\t%s\\n\")\n", $0 }
END { print ")"} ' > llenv.el
if [ $? -eq 0 ]
then
export ENV_SECOND_SUBMISSION="yes"
llsubmit $LOADL_STEP_COMMAND
else
echo Error: problem executing awk
exit 1
fi
else
emacs -batch -l llenv.el -f llenv > llenv.out
rm llenv.el
fi
The script works as follows. The first step is to check if the environmental variable ENV_SECOND_SUBMISSION has been set to something. If not, it means that this instantiation of the job has no ancestor. In this case the script calls env, grep, and awk to create the data file, llenv.el. After awk exits we inspect its exit status, $?, and only if it is 0, we define and export the new environmental variable, ENV_SECOND_SUBMISSION, and the script resubmits itself, because that is what $LOADL_STEP_COMMAND evaluates to. The variable ENV_SECOND_SUBMISSION will be visible in the second instantiation of the job, because of the LoadLeveler #@environment=COPY_ALL directive.
If the environmental variable ENV_SECOND_SUBMISSION is found to have been set to a non-zero string, the second clause of the if statement is executed. Within that clause we invoke emacs on the llenv.el file. The file is removed after emacs exits.
Observe that the #@output and #@error directives have been defined in terms of $(job_name) this time. Each instantiation of the script will have a different $(job_name), so that the output and error files for the second instantiation of the job will not overwrite output and error files written by the first instantiation of the job. That is important in case any execution problems arise.
Similar mechanism can be used in order to construct an automatically resubmitting job, which will
If you want to execute a number of very simple tasks, in a sequence of LoadLeveler steps, tasks which do not involve much, if any, shell scripting, you may prefer to use LoadLeveler's own multiple job steps facility. That facility is a little bit tricky, and, in particular, you should not try to mix LoadLeveler steps with your own self-submitting shell scripts, because that may easily lead to confusion. In particular, remember, that if you do not use the LoadLeveler keyword #@executable, and thus, according to LoadLeveler's semantics, the LoadLeveler script itself becomes the executable, when the script is passed to, say, ksh for execution, all LoadLeveler keywords will be stripped, and the whole script will be executed in one go, even if the user has separated portions of the script with multiple #@queue directives.
Consider the following LoadLeveler script:
# # Common definitions for all three steps # # @ initialdir = /home/qpsf/gustav/src/try # @ output = $(job_name).$(step_name).out # @ error = $(job_name).$(step_name).err # @ job_type = serial # @ class = half_hour # @ notification = always # @ environment = COPY_ALL # @ job_name = hello # # The first step: compile the program. # # @ step_name = compile # @ executable = /opt/gnu/bin/gcc # @ arguments = -o hello hello.c # @ queue # # The second step: run the program if the compilation was successful. # # @ step_name = run # @ dependency = compile == 0 # @ executable = /opt/gnu/bin/bash # @ arguments = -c "exec hello" # @ queue # # The third step: remove the binary if the run was successful. # # @ step_name = clean # @ dependency = run == 0 # @ executable = /usr/bin/rm # @ arguments = -e hello # @ queueWhen this script is submitted to LoadLeveler, three jobs will be placed in the queue. Initially two of those jobs will wait until the first job finishes execution. Then the second job will commence execution and the third will continue waiting. Finally, the third job will run. I should add that the second and the third jobs will run only if their direct ancestor has exited without any problems, leaving the exit status set to 0 behind.
The script is conceptually divided into four chunks.
The first chunk is a preamble with definitions common to all three job steps.
The second chunk describes the first step: it invokes the GNU C compiler and compiles a C program hello.c generating a binary hello, if the compilation has been successful.
The third chunk describes the second step: it will run only if the first step has left exit status 0 behind. That's what the directive
# @ dependency = compile == 0is about. Observe a small complication. Instead of defining
# @ executable = helloI have defined
# @ executable = /opt/gnu/bin/bash # @ arguments = -c "exec hello"The reason for this is that when the script is originally submitted to LoadLeveler, the file hello doesn't exist yet. So if I defined here #@executable = hello LoadLeveler would refuse the job and flag an error. All executables specified with the #@executable keyword must exist at the time the LoadLeveler script is submitted. The remedy is to specify my login shell as the executable instead, and then substitute (with exec) the shell with the binary produced in the first step.
The fourth chunk describes the third step: it will run only if the second step has left exit status 0 behind. That's what the directive
# @ dependency = run == 0achieves. It is your responsibility, as a programmer, to ensure that this is indeed the case when your program exits cleanly.
This step removes the binary generated by the first step. The command rm is invoked with the -e option which will leave a trace on the hello.clean.err file:
rm: Removing hello
Can the same be achieved with shell scripting? Although I have warned you about possible pitfalls when mixing scripting and LoadLeveler steps, it is OK to do so, as long as your script does not attempt to resubmit itself. You might even consider the latter, but in that case you must carefully scrutinise the logic of both the shell script and the overlaying LoadLeveler script. Things may become easily convoluted, but not necessarily incorrect! Also, you should remember that the first occurrence of the keyword #@executable will override the shell script for all consecutive steps. If a shell script is present in the LoadLeveler command file, all steps defined before the first occurrence of the keyword #@executable will see the same script. Consequently, the script itself must be able to recognise which particular step is being executed during its instantiation and differentiate its actions accordingly. That information can be obtained from the environmental variable LOADL_STEP_NAME.
Here is an example of a 3-step LoadLeveler job, equivalent to the one discussed above, in which the actions are specified entirely using a shell script rather than three different #@executables.
# @ shell = /opt/gnu/bin/bash
# @ initialdir = /home/qpsf/gustav/src/try
# @ output = $(job_name).$(step_name).out
# @ error = $(job_name).$(step_name).err
# @ job_type = serial
# @ class = half_hour
# @ notification = always
# @ environment = COPY_ALL
# @ job_name = hello
#
# @ step_name = compile
# @ queue
#
# @ step_name = run
# @ dependency = compile == 0
# @ queue
#
# @ step_name = clean
# @ dependency = run == 0
# @ queue
#
echo step: $LOADL_STEP_NAME
case $LOADL_STEP_NAME in
compile )
gcc -v -o hello hello.c 2>&1 ;;
run )
hello ;;
clean )
rm -e hello 2>&1 ;;
esac
In the remaining part of this article I will briefly discuss how to run parallel jobs under LoadLeveler.
Because the main purpose of our computer system is to support parallel jobs, especially MPI and HPF jobs, this issue is also discussed in separate articles:
From the point of view of LoadLeveler there are three basic classes of parallel jobs: POE jobs, PVM3.3 jobs, and other parallel jobs. Of these LoadLeveler knows best how to run POE jobs - these can be MPI, MPL, HPF, PVMe, and Linda jobs. They are all based on a concept of a static processor pool, i.e., processors are assigned to the job at the beginning of its execution, and no additionall processors can be grabbed by the job while it runs. Consequently, certain PVM concepts such as dynamic allocation and de-allocation of parallel machines are not supported. This applies also to the next class of parallel jobs that LoadLeveler knows about: PVM3.3 jobs. LoadLeveler knows how to start PVM3.3 daemons on allocated machines, and how to invoke a PVM3.3 application.
LoadLeveler has no idea how to run other parallel jobs, e.g., network-Linda jobs (the parallel Gaussian falls in this category), ISIS jobs, LAM jobs, etc. But sometimes LoadLeveler can be fooled into thinking that these are either POE or PVM3.3 jobs, and, at the very least it will produce a list of allocated nodes, which user programs or daemons can then distribute themselves over.
POE jobs are easiest to run under LoadLeveler. There is always the same #@executable=/usr/bin/poe involved, regardless of whether the job is an MPI, MPL, HPF, or Linda job. Only for PVMe jobs the executable is different: #@executable=/usr/lpp/pvme/bin/pvmd3e. The only difference between, say, an MPI and an HPF job is the compiler, that's been used to produce the POE binary.
At the very least POE jobs on our system should be specified by the following LoadLeveler keywords:
Here is an example of an MPI version of "hello world". The program itself looks as follows:
#include <stdio.h>
#include <mpi.h>
main(argc, argv)
int argc;
char *argv[];
{
char name[BUFSIZ];
int length;
MPI_Init(&argc, &argv);
MPI_Get_processor_name(name, &length);
printf("%s: hello world\n", name);
MPI_Finalize();
}
Compile it with the command
$ mpcc mpi-hello.c -o mpi-helloand run by submitting the following LoadLeveler script:
# @ initialdir = /home/qpsf/gustav/src/try # @ job_type = parallel # @ environment = COPY_ALL; MP_EUILIB=ip # @ requirements = (Adapter == "hps_ip") # @ min_processors = 4 # @ max_processors = 8 # @ output = mpi-hello.out # @ error = mpi-hello.err # @ executable = /usr/bin/poe # @ arguments = mpi-hello # @ class = half_hour # @ notification = always # @ queue
This is what the file mpi-hello.out may look like after the run is finished:
s1n05: hello world s1n04: hello world s1n10: hello world s1n09: hello world s1n08: hello world s1n06: hello world s1n07: hello world s1n03: hello worldThe file mpi-hello.err will contain a message from poe:
WARNING: 0031-408 8 nodes allocated by LoadLeveler, continuing...
It is possible to increase markedly the level of poe verbosity by adding the evironmental variable MP_INFOLEVEL to the list specified by the #@environment keyword:
# @ environment=COPY_ALL; MP_EUILIB=ip; MP_INFOLEVEL=6Numerous diagnostic messages will be then written on mpi-hello.err (about 20kB in case of this little program, sic!).
If you need to perform certain manipulations before and after running the main executable you can use scripting the same way as has already been discussed for sequential programs. For example, you could run the MPI "hello world" example as follows:
# @ shell = /opt/gnu/bin/bash # @ initialdir = /home/qpsf/gustav/src/try # @ job_type = parallel # @ environment = COPY_ALL; MP_EUILIB=ip; MP_INFOLEVEL=3 # @ requirements = (Adapter == "hps_ip") # @ min_processors = 4 # @ max_processors = 8 # @ output = mpi-hello.out # @ error = mpi-hello.err # @ class = half_hour # @ notification = always # @ queue poe mpi-hello
The next example shows how you can have even more control over the way your POE job is run under LoadLeveler:
# @ shell = /opt/gnu/bin/bash # @ initialdir = /home/qpsf/gustav/src/try # @ job_type = parallel # @ environment = COPY_ALL; # @ requirements = (Adapter == "hps_ip") # @ min_processors = 4 # @ max_processors = 8 # @ output = mpi-hello.out # @ error = mpi-hello.err # @ class = half_hour # @ notification = always # @ queue > host.list.$LOADL_STEP_ID NPROC=0 for node in $LOADL_PROCESSOR_LIST do echo $node >> host.list.$LOADL_STEP_ID NPROC=`expr $NPROC + 1` done # export MP_HOSTFILE=host.list.$LOADL_STEP_ID export MP_PROCS=$NPROC export MP_EUILIB=ip export MP_EUIDEVICE=css0 export MP_INFOLEVEL=3 # poe mpi-hello # rm $MP_HOSTFILE
This time I construct dynamically the POE host file and pass it to POE via the MP_HOSTFILE environmental variable. At the same time I count the number of allocated nodes. I have requested that number to be between 4 and 8, but the exact number will be known only when the job starts. That number is passed to POE via the MP_PROCS environmental variable.
It is instructive to have a look at the mpi-hello.err file produced by the run. The file may begin with something like:
INFO: DEBUG_LEVEL changed from 0 to 1 D1<L1>: Open of file host.list.s1n01.qpsf.edu.au.17791.0 successful D1<L1>: mp_euilib = ip D1<L1>: node allocation strategy = 1 INFO: 0031-119 Host s1n04.qpsf.edu.au allocated for task 0 INFO: 0031-119 Host s1n05.qpsf.edu.au allocated for task 1 INFO: 0031-119 Host s1n07.qpsf.edu.au allocated for task 2 INFO: 0031-119 Host s1n10.qpsf.edu.au allocated for task 3 INFO: 0031-119 Host s1n03.qpsf.edu.au allocated for task 4 INFO: 0031-119 Host s1n09.qpsf.edu.au allocated for task 5 INFO: 0031-119 Host s1n08.qpsf.edu.au allocated for task 6 INFO: 0031-119 Host s1n06.qpsf.edu.au allocated for task 7which shows that host names were obtained from the dynamically generated host file, and it also shows which task runs on which host.
Throughout the file you will find 8 occurrences of a line:
INFO: 0031-724 Executing program: <mpi-hello>which shows that all POE tasks successfully located and loaded the executable mpi-hello. You will also find there lines such as
D1<L1>: init_data for task 1: <203.2.136.7:4320> D1<L1>: init_data for task 2: <203.2.136.9:4688> D1<L1>: init_data for task 4: <203.2.136.5:4940> D1<L1>: init_data for task 7: <203.2.136.8:4066>The IP numbers, which appear in the brackets, e.g., 203.2.136.7 correspond to the HPS interfaces, which means that all communication takes place over the switch, even though the names specified in the host file correspond to ethernet interfaces, sic!
Then you can see lines such as
INFO: 0031-656 I/O file STDOUT closed by task 6 INFO: 0031-656 I/O file STDOUT closed by task 2 INFO: 0031-656 I/O file STDOUT closed by task 4 ...and
INFO: 0031-251 task 5 exited: rc=0 INFO: 0031-251 task 6 exited: rc=0 INFO: 0031-251 task 7 exited: rc=0 INFO: 0031-251 task 1 exited: rc=0 ...The file ends with
D1<L1>: All remote tasks have exited: maxx_errcode = 0 INFO: 0031-639 Exit status from pm_respond = 0 D1<L1>: Maximum return code from user = 0
These lines show return codes for the tasks. If something goes wrong with any of them by inspecting the mpi-hello.err file you can locate the offending process.
High Performance Fortran jobs are POE jobs, and as such they are run under LoadLeveler or interactively in the same way as POE/MPI jobs.
Article ``High Performance Fortran'' discusses
In short, the program is called jacobi.f. It is compiled like that:
$ xlhpf90 -o jacobi jacobi.fand the LoadLeveler script used to run it looks as follows:
# @ initialdir = /home/qpsf/gustav/HPF/aix # @ job_type = parallel # @ environment = COPY_ALL; MP_EUILIB=ip; MP_INFOLEVEL=6 # @ requirements = (Adapter == "hps_ip") # @ min_processors = 4 # @ max_processors = 8 # @ output = jacobi.out # @ error = jacobi.err # @ executable = /usr/bin/poe # @ arguments = jacobi # @ class = half_hour # @ notification = always # @ queueAs you see there is nothing really HPF specific here. Running the job with MP_INFOLEVEL=6 will show details of communication taking place between participating processes on the jacobi.err file. For example:
D3<L4>: Message type 21 from source 0 D3<L4>: Message type 44 from source 5 D3<L4>: Message type 15 from source 0
You will find Linda examples in the directory /opt/linda/common. The subdirectory cl-examples contains C and C++ examples, and the subdirectory fl-examples contains Fortran-77 examples.
There is a ping.cl example in cl-examples. That program creates two concurrent processes, which play a sort of a ping-pong game. If you copy that file to your $HOME directory, and if you add /opt/linda/sp2dm-4.1/bin to your command search PATH, you can compile and link that program with the command:
gustav@s1n01:~/cl-examples 367 $ clc -g -o ping ping.cl clc (V4.0.1 Distributed Memory) gustav@s1n01:~/cl-examples 368 $
To run that program, simply edit the following file, and save it on, say, ping.ll:
# @ initialdir = /home/qpsf/gustav/cl-examples # @ job_type = parallel # @ environment = COPY_ALL; MP_EUILIB=ip; MP_INFOLEVEL=0 # @ requirements = (Adapter == "hps_ip") # @ min_processors = 4 # @ max_processors = 8 # @ output = ping.out # @ error = ping.err # @ executable = /usr/bin/poe # @ arguments = ./ping 1000 # @ class = half_hour # @ notification = always # @ queueAs you see, as in our HPF example, there is nothing here that would be Linda specific. As far as LoadLeveler, AIX, and SP are concerned, this is just a POE job.
Submit that file to LoadLeveler with the command
gustav@s1n01:~/cl-examples 368 $ llsubmit ping.ll submit: The job "s1n01.17931" has been submitted. gustav@s1n01:~/cl-examples 369 $
After the job exits, you will see the following on ping.out:
sp2dm Linda Runtime System: version Tue Jun 11 13:24:20 EDT 1996 Note Since Start Since Last timer started. 0.000 0.000 evals done. 0.003 0.003 done. 1.319 1.316
Here is another example, which requires a more elaborate LoadLeveler script. The program below is a sort of Linda distributed "Hello world", similar to our MPI example:
#include <stdio.h>
int real_main(argc, argv)
int argc;
char *argv[];
{
int nworker, j, hello();
char name[BUFSIZ];
nworker=atoi(argv[1]);
gethostname(name, BUFSIZ);
printf("%s: the master process.\n", name);
printf("%s: spawning function \"hello\" on %d worker processes.\n",
name, nworker);
for (j=0; j < nworker; j++) eval("worker", hello(j));
printf("%s: receiving %d \"dones\" from worker processes.\n",
name, nworker);
for (j=0; j < nworker; j++) in("done");
printf("%s: finished.\n", name);
return(0);
}
int hello(i)
int i;
{
char name[BUFSIZ];
gethostname(name, BUFSIZ);
printf("\tHello from number %d running on %s.\n", i, name);
out("done");
return(0);
}
When this program is invoked it takes the number of slaves from the command line. If we specify different values for #@min_processors and #@max_processors, we will not know how many processors we are going to get until the LoadLeveler script actually runs. So the trick is to inform the Linda program that it should spawn, say, 7 workers if LoadLeveler allocates 8 processors to this run (the 8th process is the master process). Below is a script which accomplishes that. Observe that this time we manipulate the POE environment ourselves, taking from LoadLeveler a list of allocated nodes ($LOADL_PROCESSOR_LIST) and a step id number ($LOADL_STEP_ID) only. The latter is used to generate a unique file name for the host list file.
# @ shell = /opt/gnu/bin/bash # @ initialdir = /home/qpsf/gustav/cl-examples # @ job_type = parallel # @ environment = COPY_ALL; # @ requirements = (Adapter == "hps_ip") # @ min_processors = 4 # @ max_processors = 8 # @ output = hello.out # @ error = hello.err # @ class = half_hour # @ notification = always # @ queue > host.list.$LOADL_STEP_ID NPROC=0 for node in $LOADL_PROCESSOR_LIST do echo $node >> host.list.$LOADL_STEP_ID NPROC=`expr $NPROC + 1` done # export MP_HOSTFILE=host.list.$LOADL_STEP_ID export MP_PROCS=$NPROC export MP_EUILIB=ip export MP_EUIDEVICE=css0 export MP_INFOLEVEL=3 # poe ./hello `expr $NPROC - 1` # rm $MP_HOSTFILE
The hello.out file generated by this example may look something like that:
gustav@s1n01:~/cl-examples 482 $ cat hello.out
sp2dm Linda Runtime System: version Tue Jun 11 13:24:20 EDT 1996
s1n07: the master process.
Hello from number 0 running on s1n03.
Hello from number 1 running on s1n10.
Hello from number 2 running on s1n06.
Hello from number 3 running on s1n08.
Hello from number 5 running on s1n04.
Hello from number 6 running on s1n09.
Hello from number 4 running on s1n05.
s1n07: spawning function "hello" on 7 worker processes.
s1n07: receiving 7 "dones" from worker processes.
s1n07: finished.
gustav@s1n01:~/cl-examples 483 $
Observe that messages arriving from various
processes may not necessarily be registered by the
system and displayed in the order
in which you think they should have arrived.
For example, the message about spawning function "hello"
chronologically was produced before the workers
had a chance to generate their Hello from number messages.
Yet in the listing above it appears after those
Hello messages.
PVMe is an implementation of PVM designed for use on an IBM SP system with the High Performance Switch. Although compatible with PVM, PVMe has a different internal structure, not in the least because it runs on a homogeneous platform, and because it makes use of special primitives for task synchronisation.
There is an example PVM benchmark code in /usr/lpp/pvme/sample. In order to compile and run that benchmark, do as follows. First, create the following directories in your $HOME:
PARAMETER(IPROC=1)
with, say,
PARAMETER(IPROC=5)
(this means that PVMe will attempt to spawn 5
processes when this program is run). Then type:
gustav@s1n01:~/pvm3/src/sample 823 $ make
xlf -c -O hostbenc.f
** hostbenc === End of Compilation 1 ===
1501-510 Compilation successful for file hostbenc.f.
xlf -L/usr/lpp/ssp/css/libus -lmpci hostbenc.o -lpvm3 -lfpvm3 \
-o /home/qpsf/gustav/pvm3/bin/RS6K/hostbenc \
-bI:/usr/lib/pvm3e.exp
xlf -c -O nodebenc.f
** nodebenc === End of Compilation 1 ===
1501-510 Compilation successful for file nodebenc.f.
xlf -L/usr/lpp/ssp/css/libus -lmpci nodebenc.o -lpvm3 -lfpvm3 \
-o /home/qpsf/gustav/pvm3/bin/RS6K/nodebenc \
-bI:/usr/lib/pvm3e.exp
gustav@s1n01:~/pvm3/src/sample 824 $
Because the binaries have been linked with the libus libraries instead of libip libraries, they will have to be run in the user mode, i.e., only this one parallel job will be allowed to run on the nodes allocated by LoadLeveler. So, we will have to run it on pool 2, submitting the job to a dedicated class.
My LoadLeveler script for this job looks as follows:
# @ shell = /opt/gnu/bin/bash # @ initialdir = /home/qpsf/gustav/pvm3/bin/RS6K # @ job_type = parallel # @ environment = COPY_ALL; # @ requirements = (Adapter == "hps_user") # @ min_processors = 6 # @ max_processors = 6 # @ output = hostbenc.out # @ error = hostbenc.err # @ class = half_hour_dedicated # @ notification = always # @ queue ln -s /usr/bin/rsh $HOME/bin/rsh hash -r /usr/lpp/pvme/bin/pvmd3e \ -exec /home/qpsf/gustav/pvm3/bin/RS6K/hostbenc << EOF 40000 5 3 EOF rm $HOME/bin/rshObserve that since I will be spawning 5 PVMe processes within this job, I need 6 processors altogether. The 6th process is the master. Also observe that before invoking pvmd3e I have linked /usr/bin/rsh to my $HOME/bin. That is because I use Kerberised rsh in my everyday life. But LoadLeveler will not pass on my Kerberos credentials. PVMe uses rsh to spawn processes - if I didn't replace Kerberised rsh with /usr/bin/rsh, that procedure would have generated errors and unnecessary delays. After the PVMe job exits I remove the link, which is no longer needed.
When the job has finished the file hostbenc.out will contain the benchmark results, which may look like that:
... Local task: TIME: 21017.9686546325684 MICROSECONDS . Local task: BANDWIDTH 7.61253393347188378 MB/s. Local task: BANDWIDTH PUT 12782.6407296316975 MB/s. Local task: BANDWIDTH SND 22.3030837436868055 MB/s. ...As to the correctness of these results, I suspect a bug in the way the first BANDWIDTH is calculated (there is an underlying assumption there that there are only two communicating processes, but we have spawned five!), but... that's a different matter altogether. The difference between PUT and SND bandwidths illustrates how slow network communication is, even over the High Performance Switch, in comparison with memory-to-memory copy. You should always keep that in mind when developing distributed computer programs.
Parallel Gaussian is somewhat similar to PVMe, because it is implemented on top of network-Linda, as opposed to POE-Linda discussed above. Network-Linda processes are spawned by /usr/bin/rsh. There is an additional complication there caused by the fact that a list of allocated nodes returned by LoadLeveler in $LOADL_PROCESSOR_LIST, corresponds to ethernet interfaces. Whereas POE and PVMe translate those names automatically to HPS interfaces, network-Linda doesn't do that. So, we have to do the translation ourselves, before starting the parallel Gaussian process.
The article ``The Parallel Gaussian'' discusses in detail how to run parallel Gaussian jobs under LoadLeveler on our system. Please refer to that article for more information.