LoadLeveler Hints and Recipes

by
Zdzislaw Meglicki

22nd January/6th February 1997

How to submit ...


How to submit an interactive job

Sometimes you may wish to interact with your production jobs by conversing with them via a command line or an X11 interface. A simple way to arrange for this mode of execution is to ask LoadLeveler to give you an xterm window running on a node serving a given class. Once the window appears on your X11 display, you can enter whatever commands you wish and run your application interactively, the same way you would run it on one of the front-end nodes (i.e., s1n01 or s1n02).

The difference is that the node given to you by LoadLeveler will almost always be less loaded than the front end nodes. If you request the xterm to run on a node of pool 2, you will have the whole node to yourself. On nodes of pool 1 you may have to share a node with up to 3 other jobs. Another difference is that interactive jobs running under LoadLeveler will be accounted. Although you may not find it quite so exciting from your point of view, from our point of view the situation is diametrically the opposite, to the extent that we are prepared to kill without warning or mercy any production jobs found running on the front end nodes - interactive or not!

In order to run xterm under LoadLeveler, first edit a LoadLeveler script which should look as follows:

# @ output = $(job_name).out
# @ error = $(job_name).err
# @ job_type = serial
# @ class = half_hour
# @ notification = always
# @ environment = COPY_ALL
# @ executable = /usr/bin/X11/xterm
# @ arguments = -ls -sb -sl 300 -n $(job_name) -T $(job_name)
# @ queue
and save it, say, on xterm.ll.

Before submitting the script to LoadLeveler ensure that X-windows programs running on the SP will be allowed access to your X11 display. The easiest way to do that is to add an appropriate authorisation entry for your display to the .Xauthority file on the SP with the command such as

$ xauth add nazgul.qpsf.edu.au:0  MIT-MAGIC-COOKIE-1  62dd45bc706a4415357c7973386d7112
Of course, you should replace the name of the display (nazgul.qpsf.edu.au:0), the protocol name (MIT-MAGIC-COOKIE-1), and the key data (62dd45bc706a4415357c7973386d7112) with whatever is appropriate for your X11 display.

Now define the display itself and submit the job:

$ export DISPLAY=nazgul.qpsf.edu.au:0.0
$ llsubmit xterm.ll
The value of the DISPLAY will be passed on to xterm, because I have used the
# @ environment = COPY_ALL
directive.

In the same way you can run any other interactive X11 application under LoadLeveler. For example, the following LoadLeveler script will run GNU emacs

# @ output = $(job_name).out
# @ error = $(job_name).err
# @ job_type = serial
# @ class = half_hour
# @ notification = always
# @ environment = COPY_ALL
# @ executable = /opt/gnu/bin/emacs
# @ queue
Remember that you must have the environmental variable DISPLAY bound to your X11 display for this to work.

Special notes for csh and tcsh users

If your login shell is either csh or tcsh, LoadLeveler may have problems with cancellation of your interactive jobs, because of the way those jobs will be spawned through an intermediary csh, which won't go away. The way around this problem is to replace the script discussed above with the following:

# @ shell = /opt/gnu/bin/bash
# @ output = xterm.out
# @ error = xterm.err
# @ job_type = serial
# @ class = half_hour_dedicated
# @ notification = always
# @ environment = COPY_ALL
# @ queue
exec xterm -ls -sb -sl 300 -n `hostname` -T `hostname`
This script tells LoadLeveler explicitly to spawn the job by using exec from within an /opt/gnu/bin/bash process. The effect of this will be such that bash will be replaced with xterm, so that llcancel will send its signal to the right process this time.

Another problem that may occur sometimes, and that will occur if your login shell is csh or tcsh, is that the X11 server managing your display will not receive the correct X11 authority information (protocol and key-data) from xterm in this context. This looks like a bug. We don't have a proper fix for it yet. In that case you will have to open the server to the world by issuing the command:

$ xhost +
before submitting the job with llsubmit. Once you get the window, you can close the server again with the command:
$ xhost -
From this point onwards, information in your .Xauthority file will be correctly passed to your X11 server by other X11 applications invoked from that window.

How to submit a not quite interactive job

There is a class of jobs, which many users who lack UNIX skills think of as interactive jobs, but, which, in fact, aren't interactive at all. Applications which take input from a command line in some sort of an application dependent language fall in that category. Examples are Common Lisp, Smalltalk, Scheme, Matlab, Xplor, GeneHunter, etc.

Often a user has a command file prepared, which must be loaded into an application from an interactive session. Once the command file is loaded the application begins executing a program, which may take hours to complete. A user at that stage goes away leaving an active telnet connection or an X11 window on the display. The window was needed only to load the file and perhaps issue some start-up commands for the computation.

Jobs like that are not interactive at all and they can and should be run under LoadLeveler without asking for an xterm window and forking an unnecessary login shell.

The way to execute such jobs is to use the here-input feature of UNIX shells:

$ my_command << EOF
   one_line_of_input
   another_line_of_input
EOF

Here is an example:

# @ shell = /opt/gnu/bin/bash
# @ initialdir = /home/qpsf/gustav/src/try
# @ output = hello-lisp.out
# @ error = hello-lisp.err
# @ job_type = serial
# @ class = half_hour
# @ notification = always
# @ environment = COPY_ALL
# @ queue
clisp -q << EOF
(load "hello.fas")
(hello)
EOF
Here I have a Common Lisp program stored on a file hello.fas. Normally, in order to execute that program I would have to enter a Common Lisp environment with the command clisp. Then I would have to load the file which contains the program, and finally I would have to evaluate the function, which has been defined on hello.fas, by typing (hello).

But all that can be accomplished also by typing

clisp -q << EOF
(load "hello.fas")
(hello)
EOF
in a shell script without having to invoke xterm first. A UNIX shell (csh, tcsh, sh, ksh or bash) on encountering a construct like that will take the text enclosed by the
<< EOF
...
EOF
construct and will pass it to the program, in this case clisp, as if it has been typed in by the user.

Long Matlab, GeneHunter, Xplor, etc., computations can and should be run like that too.


How to submit a simple sequential batch job

By a simple batch job I mean running just one program under LoadLeveler, a little like in the interactive example above, without any pre or post-processing. Consequently the LoadLeveler script looks quite similar too.

Here is a simple hello world program written in Emacs Lisp:

(defun hello ()
   (princ "hello world\n"))
To execute this program in the Emacs batch mode under LoadLeveler I had first saved it on a file hello.el. Then I edited the following LoadLeveler script:
# @ initialdir = /home/qpsf/gustav/src/try
# @ executable = /opt/gnu/bin/emacs
# @ arguments = -batch -l hello.el -f hello
# @ output = emacs-batch.out
# @ error = emacs-batch.err
# @ job_type = serial
# @ class = half_hour
# @ notification = always
# @ environment = COPY_ALL
# @ queue
and saved it on emacs-batch.ll.

The script was submitted to LoadLeveler with the command:

$ llsubmit emacs-batch.ll

When the job had finished its execution there was a file emacs-batch.out left in my ~/src/try subdirectory:

$ cat emacs-batch.out
hello world
$

An alternative way is to make the primary LoadLeveler job a shell script, and to execute emacs from within it. In that case you must not use the #@executable and the #@arguments directives. Instead, use the #@shell directive, to specify the shell of choice for your command file. Here is an example:

# @ shell = /opt/gnu/bin/bash
# @ initialdir = /home/qpsf/gustav/src/try
# @ output = emacs-batch.out
# @ error = emacs-batch.err
# @ job_type = serial
# @ class = half_hour
# @ notification = always
# @ environment = COPY_ALL
# @ queue
emacs -batch -l hello.el -f hello 


How to submit a more complex sequential batch job

Often you may wish to perform some preliminary manipulations on your data files before passing them on to your application for execution, and after that's done, you may wish to do some clean-up work, perhaps making sure that various scratch files have been removed, etc.

The way to do that is to use the second approach presented in the previous section, i.e., to submit a shell script to LoadLeveler.

Below is an example of a job like that. This job comprises three steps:

  1. Obtain information about LoadLeveler variables and write them on a data file. The information is obtained by running env and grep and the data file is constructed by running awk. The data file is written in the form of an Emacs Lisp program.
  2. Invoke the application on the data file, which has been generated dynamically in the previous step. The application, in this case, is emacs, and the data file is the Emacs Lisp program saved on llenv.el.
  3. Cleanup, i.e., remove the data file generated in step 1, since it is no longer needed.
Of course, there are simpler ways to list LoadLeveler environmental variables, but this toy example neatly illustrates the idea of a three step procedure:
  1. data preprocessing/generation
  2. data manipulation by the main application
  3. data cleanup
Most simple LoadLeveler jobs have this structure.
# @ shell = /opt/gnu/bin/bash
# @ initialdir = /home/qpsf/gustav/src/try
# @ output = $(job_name).out
# @ error = $(job_name).err
# @ job_type = serial
# @ class = half_hour
# @ notification = always
# @ environment = COPY_ALL
# @ queue
env | grep LOADL | \
awk ' BEGIN { 
         { printf "(defun llenv ()\n" } 
         { printf "   (princ \"LoadLeveler variables:\\n\") " }
      }
      { printf "   (princ \"\t%s\\n\")\n", $0 } 
      END { print ")"} ' > llenv.el
emacs -batch -l llenv.el -f llenv > llenv.out
rm llenv.el

Having saved this LoadLeveler script on env.ll I have submitted it with

$ llsubmit env.ll
and then viewed the results of the run as follows:
$ cat llenv.out
LoadLeveler variables:
        LOADL_STEP_CLASS=half_hour
        LOADL_STEP_ARGS=
        LOADL_STEP_ID=s1n01.qpsf.edu.au.17417.0
        LOADL_STARTD_PORT=9611
        LOADL_STEP_NICE=0
        LOADL_STEP_IN=/dev/null
        LOADL_STEP_ERR=s1n01.qpsf.edu.au.17417.err
        LOADL_STEP_GROUP=qpsf
        LOADL_STEP_NAME=0
        LOADL_STEP_ACCT=
        LOADL_STEP_TYPE=SERIAL
        LOADL_STEP_OWNER=gustav
        LOADL_ACTIVE=1.2.1.11
        LOADL_STEP_COMMAND=env.ll
        LOADL_JOB_NAME=s1n01.qpsf.edu.au.17417
        LOADL_STEP_OUT=s1n01.qpsf.edu.au.17417.out
        LOADL_STEP_INITDIR=/home/qpsf/gustav/src/try
        LOADL_PROCESSOR_LIST=s1n10.qpsf.edu.au 
        LOADLBATCH=yes
$ 

Observe that you can make use of all those LoadLeveler environmental variables in your LoadLeveler scripts. In particular, the variable LOADL_PROCESSOR_LIST is often used in manipulating parallel jobs. In this case it comprises only one machine name, because the job is SERIAL.

How to submit a sequential Gaussian job

An important example of a sequential job which often has to do some pre and post-processing of data is Gaussian. There are several issues Gaussian users should pay attention to:

  1. During execution Gaussian generates, and operates on very large scratch files. Those scratch files must never be written on NFS (e.g., on your $HOME or on /dtmp), because that would slow down the Gaussian job to a crawl. Use either /tmp, which is a local file system attached directly to the node you run your Gaussian job on, or /ptmp, which is a Parallel I/O File System. Contact gustav@indiana.edu if you don't have your own directory on /ptmp.
  2. Gaussian has its own elaborate mechanism for checkpointing. Many, if not most, Gaussian jobs can be checkpointed. Some cannot. There is a special low priority slow queue called gaussian for long Gaussian jobs, which cannot be checkpointed. In general, you ought to understand that neither hardware nor logical architecture of our machine is particularly suitable for such jobs. Often you may be better off running a parallel Gaussian job in the offpeak_dedicated queue, perhaps over weekend. A parallel job using effectively all 12 processors of pool 2 over weekend can do up to 720 CPU hours of computing in one go. That is equivalent to 30 days (sic!) of sequential computing.

Here is an example of a sequential Gaussian job. For discussion go to ``Sequential Gaussian: Discussion'' towards the end of this section.

# @ shell = /opt/gnu/bin/bash
# @ initialdir = /home/qpsf/gustav/Gaussian/test161
# @ output = $(job_name).out
# @ error = $(job_name).err
# @ job_type = serial
# @ class = two_hour_dedicated
# @ notification = always
# @ environment = COPY_ALL
# @ queue
export GAUSS_CHK_ROOT=test161
export GAUSS_OUTPUT=test161.out
export GAUSS_SCRDIR=/ptmp/gustav/test161
#
export g94root=/opt/gaussian
. $g94root/g94/bsd/g94.profile
#
if [ \! -d $GAUSS_SCRDIR ]
then
   mkdir $GAUSS_SCRDIR
else
   if [ -n "$GAUSS_CLEAN_SCRATCH" ]
   then
      echo cleaning $GAUSS_SCRDIR
      (
         cd $GAUSS_SCRDIR
         rm -rf *
      )
   fi
fi
#
/bin/time g94 << EOF >> $GAUSS_OUTPUT
%chk=$GAUSS_CHK_ROOT
#p uhf/sto-3g test pop=full scf=conventional guess=mix
     
Gaussian Test Job 161 (Part 1):
Ketene, bent TS, UHF for later CAS-UNO
     
0 1
C
X 1 1.0
O 1 CO 2 TH
C 1 CC 2 TH 3 180.0
H 4 CH 1 HCC 2 DI
H 4 CH 1 HCC 2 -DI
     
CC 2.225
CO 1.178
TH 38.17
CH 1.12
HCC 102.9
DI 52.6
     
--Link1--
%chk=$GAUSS_CHK_ROOT
%nosave
#p cas(4,uno,4,qc)/sto-3g test scf=conventional pop=full guess=read
     
Gaussian Test Job 161 (Part 2):
Ketene, bent TS, sto 3g CASUNO
     
0 1
C
X 1 1.0
O 1 CO 2 TH
C 1 CC 2 TH 3 180.0
H 4 CH 1 HCC 2 DI
H 4 CH 1 HCC 2 -DI
     
CC 2.225
CO 1.178
TH 38.17
CH 1.12
HCC 102.9
DI 52.6
     
EOF
#
if grep "Normal termination of Gaussian 94" $GAUSS_OUTPUT
then
   echo Test161 finished successfully.
   echo -n Cleaning scratch directory ... 
   rm -rf $GAUSS_SCRDIR
   echo done.
else
   echo Gaussian run did not terminate normally
   echo Checkfile is ${GAUSS_CHK_ROOT}.chk in $LOADL_STEP_INITDIR.
fi

Sequential Gaussian: Discussion

A few words of explanations. The LoadLeveler directives in this job merely declare the shell, the initial directory, general output and error files, the job type, and the LoadLeveler class to submit the job to.

Before Gaussian itself is invoked I define four environmental variables related to this run:

GAUSS_CHK_ROOT
the checkpoint file name root: the checkpoint file itself will be called $GAUSS_CHK_ROOT.chk. Gaussian will write that file in the $LOADL_STEP_INITDIR directory, i.e., in /home/qpsf/gustav/Gaussian/test161 in this case.
GAUSS_OUTPUT
a file to capture Gaussian output; other stuff, i.e., various shell messages will still go to $(job_name).out as specified by the #@output directive.
GAUSS_SCRDIR
Gaussian scratch directory: that variable, unlike other variables defined in the script, will be looked up by Gaussian itself. Observe that the scratch directory here will be created (if need be) on /ptmp. On our system PIOFS is mounted on that directory.
g94root
Gaussian root directory: that is a directory on the system where Gaussian binaries and auxiliary files live. On our system that directory is /opt/gaussian.
After the variables are defined I source Gaussian environment with the command
. $g94root/g94/bsd/g94.profile
Gaussian environment files for csh and tcsh also can be found in $g94root/g94/bsd.

The next step checks if the Gaussian scratch directory already exists and creates it if it doesn't. If it does, I check if another environmental variable, GAUSS_CLEAN_SCRATCH, has been defined, and clean the scratch directory before running Gaussian.

Finally, Gaussian itself is invoked. The run is timed (with /bin/time) and Gaussian input is taken from the following text until the string EOF is encountered. It is here that I redirect Gaussian output to $GAUSS_OUTPUT. Observe that the shell variable $GAUSS_CHK_ROOT is used in the input. That variable will be replaced by its value before the input is passed to Gaussian.

When Gaussian exits, the script inspects the log written on $GAUSS_OUTPUT searching for the occurrence of a string "Normal termination of Gaussian 94". If the string is found, the scratch directory is removed. Otherwise the directory is left in place and a message about the location of the checkpoint file is written on $(job_name).out.


How to submit a number of dependent jobs

The postprocessing or preprocessing of data may sometimes be so involved that it should be performed as a separate LoadLeveler job, rather than combined with the main computational task.

The simplest way to procede in such situation would be to submit one LoadLeveler job, then wait for it to finish execution, and then to submit the second job. The submission of the second job could be performed from within the LoadLeveler script of the first job.

The following two scripts split the example from the ``How to submit a more complex sequential batch job'' section into two steps.

The first script, called env-1.ll uses commands env, grep, and awk to generate a data file, in this case an Emacs Lisp code, which is saved on llenv.el (remember that programs are data, and, in particular, in case of Lisp, there is no semantic difference between programs and data: both are stored in the same data section of a Lisp process, and both can be modified dynamically during program execution). Once awk exists the script checks if the data file is there (for example, an error may have occurred while executing awk). It also checks if the second LoadLeveler script can be found in its working directory. If both files are present, the second script is submitted with the llsubmit command.

# @ shell = /opt/gnu/bin/bash
# @ initialdir = /home/qpsf/gustav/src/try
# @ output = env-1.out
# @ error = env-1.err
# @ job_type = serial
# @ class = half_hour
# @ notification = always
# @ environment = COPY_ALL
# @ queue
env | grep LOADL | \
awk ' BEGIN { 
         { printf "(defun llenv ()\n" } 
	 { printf "   (princ \"LoadLeveler variables:\\n\") " }
      }
      { printf "   (princ \"\t%s\\n\")\n", $0 } 
      END { print ")"} ' > llenv.el
if [ -f llenv.el -a -f env-2.ll ]
then
   llsubmit env-2.ll
fi

The second script is called env-2.ll. First it checks if the file llenv.el exists. Even though we have already checked that within env-1.ll, here we do so again, because the scripts are separate, and there is always a possibility that env-2.ll may have been submitted without running env-1.ll first. If the file exists, we run emacs on it, if it doesn't, we flag an error and exit. The data file itself, llenv.el, is removed after emacs had its way with it.

# @ shell = /opt/gnu/bin/bash
# @ initialdir = /home/qpsf/gustav/src/try
# @ output = env-2.out
# @ error = env-2.err
# @ job_type = serial
# @ class = half_hour
# @ notification = always
# @ environment = COPY_ALL
# @ queue
if [ -f llenv.el ]
then
   emacs -batch -l llenv.el -f llenv > llenv.out
else
   echo Error: env-2.ll job: llenv.el not found
   exit 1
fi
rm llenv.el

It is easy to restructure the two scripts above into one script, which first performs one task, then resubmits itself, and on the second invocation performs the second task.

In order to do that, the script must be able to find out on its own, whether its current instantiation is the first or the second one. If you have a creepy feeling now that we are getting close to talking about reincarnation, well, yes, you're quite right. That's exactly what we're talking about! How can a process know that it already lived before?

The answer is: by inspecting its environment and finding a particular variable set. The variable would be set during the first instantiation of the LoadLeveler job. It would not exist at all outside of those LoadLeveler jobs, i.e., the user should make sure that it is unset in the user's normal environment.

Here's the script:

# @ shell = /opt/gnu/bin/bash
# @ initialdir = /home/qpsf/gustav/src/try
# @ output = $(job_name).out
# @ error = $(job_name).err
# @ job_type = serial
# @ class = half_hour
# @ notification = always
# @ environment = COPY_ALL
# @ queue
if [ -z "$ENV_SECOND_SUBMISSION" ]
then
   env | grep LOADL | \
   awk ' BEGIN { 
            { printf "(defun llenv ()\n" } 
	    { printf "   (princ \"LoadLeveler variables:\\n\") " }
         }
         { printf "   (princ \"\t%s\\n\")\n", $0 } 
         END { print ")"} ' > llenv.el
   if [ $? -eq 0 ]
   then
      export ENV_SECOND_SUBMISSION="yes"
      llsubmit $LOADL_STEP_COMMAND
   else
      echo Error: problem executing awk
      exit 1
   fi
else
   emacs -batch -l llenv.el -f llenv > llenv.out
   rm llenv.el
fi

The script works as follows. The first step is to check if the environmental variable ENV_SECOND_SUBMISSION has been set to something. If not, it means that this instantiation of the job has no ancestor. In this case the script calls env, grep, and awk to create the data file, llenv.el. After awk exits we inspect its exit status, $?, and only if it is 0, we define and export the new environmental variable, ENV_SECOND_SUBMISSION, and the script resubmits itself, because that is what $LOADL_STEP_COMMAND evaluates to. The variable ENV_SECOND_SUBMISSION will be visible in the second instantiation of the job, because of the LoadLeveler #@environment=COPY_ALL directive.

If the environmental variable ENV_SECOND_SUBMISSION is found to have been set to a non-zero string, the second clause of the if statement is executed. Within that clause we invoke emacs on the llenv.el file. The file is removed after emacs exits.

Observe that the #@output and #@error directives have been defined in terms of $(job_name) this time. Each instantiation of the script will have a different $(job_name), so that the output and error files for the second instantiation of the job will not overwrite output and error files written by the first instantiation of the job. That is important in case any execution problems arise.

Running very long jobs in a checkpoint and resubmit loop

Similar mechanism can be used in order to construct an automatically resubmitting job, which will

until the whole lengthy computation, which may take many CPU days and many resubmissions, is finished. See article ``How to Time, Save, and Resubmit your LoadLeveler Jobs'' for a detailed discussion of how to do that and examples in C, Fortran-90, and Common Lisp.

Using LoadLeveler steps

If you want to execute a number of very simple tasks, in a sequence of LoadLeveler steps, tasks which do not involve much, if any, shell scripting, you may prefer to use LoadLeveler's own multiple job steps facility. That facility is a little bit tricky, and, in particular, you should not try to mix LoadLeveler steps with your own self-submitting shell scripts, because that may easily lead to confusion. In particular, remember, that if you do not use the LoadLeveler keyword #@executable, and thus, according to LoadLeveler's semantics, the LoadLeveler script itself becomes the executable, when the script is passed to, say, ksh for execution, all LoadLeveler keywords will be stripped, and the whole script will be executed in one go, even if the user has separated portions of the script with multiple #@queue directives.

Consider the following LoadLeveler script:

#
# Common definitions for all three steps
#
# @ initialdir = /home/qpsf/gustav/src/try
# @ output = $(job_name).$(step_name).out
# @ error = $(job_name).$(step_name).err
# @ job_type = serial
# @ class = half_hour
# @ notification = always
# @ environment = COPY_ALL
# @ job_name = hello
#
# The first step: compile the program.
#
# @ step_name = compile
# @ executable = /opt/gnu/bin/gcc
# @ arguments = -o hello hello.c
# @ queue
#
# The second step: run the program if the compilation was successful.
#
# @ step_name = run
# @ dependency = compile == 0 
# @ executable = /opt/gnu/bin/bash
# @ arguments = -c "exec hello"
# @ queue
#
# The third step: remove the binary if the run was successful.
#
# @ step_name = clean
# @ dependency = run == 0
# @ executable = /usr/bin/rm
# @ arguments = -e hello
# @ queue
When this script is submitted to LoadLeveler, three jobs will be placed in the queue. Initially two of those jobs will wait until the first job finishes execution. Then the second job will commence execution and the third will continue waiting. Finally, the third job will run. I should add that the second and the third jobs will run only if their direct ancestor has exited without any problems, leaving the exit status set to 0 behind.

The script is conceptually divided into four chunks.

The first chunk is a preamble with definitions common to all three job steps.

The second chunk describes the first step: it invokes the GNU C compiler and compiles a C program hello.c generating a binary hello, if the compilation has been successful.

The third chunk describes the second step: it will run only if the first step has left exit status 0 behind. That's what the directive

# @ dependency = compile == 0
is about. Observe a small complication. Instead of defining
# @ executable = hello
I have defined
# @ executable = /opt/gnu/bin/bash
# @ arguments = -c "exec hello"
The reason for this is that when the script is originally submitted to LoadLeveler, the file hello doesn't exist yet. So if I defined here #@executable = hello LoadLeveler would refuse the job and flag an error. All executables specified with the #@executable keyword must exist at the time the LoadLeveler script is submitted. The remedy is to specify my login shell as the executable instead, and then substitute (with exec) the shell with the binary produced in the first step.

The fourth chunk describes the third step: it will run only if the second step has left exit status 0 behind. That's what the directive

# @ dependency = run == 0
achieves. It is your responsibility, as a programmer, to ensure that this is indeed the case when your program exits cleanly.

This step removes the binary generated by the first step. The command rm is invoked with the -e option which will leave a trace on the hello.clean.err file:

rm: Removing hello

Can the same be achieved with shell scripting? Although I have warned you about possible pitfalls when mixing scripting and LoadLeveler steps, it is OK to do so, as long as your script does not attempt to resubmit itself. You might even consider the latter, but in that case you must carefully scrutinise the logic of both the shell script and the overlaying LoadLeveler script. Things may become easily convoluted, but not necessarily incorrect! Also, you should remember that the first occurrence of the keyword #@executable will override the shell script for all consecutive steps. If a shell script is present in the LoadLeveler command file, all steps defined before the first occurrence of the keyword #@executable will see the same script. Consequently, the script itself must be able to recognise which particular step is being executed during its instantiation and differentiate its actions accordingly. That information can be obtained from the environmental variable LOADL_STEP_NAME.

Here is an example of a 3-step LoadLeveler job, equivalent to the one discussed above, in which the actions are specified entirely using a shell script rather than three different #@executables.

# @ shell = /opt/gnu/bin/bash
# @ initialdir = /home/qpsf/gustav/src/try
# @ output = $(job_name).$(step_name).out
# @ error = $(job_name).$(step_name).err
# @ job_type = serial
# @ class = half_hour
# @ notification = always
# @ environment = COPY_ALL
# @ job_name = hello
#
# @ step_name = compile
# @ queue
#
# @ step_name = run
# @ dependency = compile == 0 
# @ queue
#
# @ step_name = clean
# @ dependency = run == 0
# @ queue
#
echo step: $LOADL_STEP_NAME
case $LOADL_STEP_NAME in
   compile ) 
      gcc -v -o hello hello.c 2>&1 ;;
   run ) 
      hello ;;
   clean ) 
      rm -e hello 2>&1 ;;
esac


How to submit a parallel job

In the remaining part of this article I will briefly discuss how to run parallel jobs under LoadLeveler.

Because the main purpose of our computer system is to support parallel jobs, especially MPI and HPF jobs, this issue is also discussed in separate articles:

and some of the material presented in those articles is quoted here for completeness.

From the point of view of LoadLeveler there are three basic classes of parallel jobs: POE jobs, PVM3.3 jobs, and other parallel jobs. Of these LoadLeveler knows best how to run POE jobs - these can be MPI, MPL, HPF, PVMe, and Linda jobs. They are all based on a concept of a static processor pool, i.e., processors are assigned to the job at the beginning of its execution, and no additionall processors can be grabbed by the job while it runs. Consequently, certain PVM concepts such as dynamic allocation and de-allocation of parallel machines are not supported. This applies also to the next class of parallel jobs that LoadLeveler knows about: PVM3.3 jobs. LoadLeveler knows how to start PVM3.3 daemons on allocated machines, and how to invoke a PVM3.3 application.

LoadLeveler has no idea how to run other parallel jobs, e.g., network-Linda jobs (the parallel Gaussian falls in this category), ISIS jobs, LAM jobs, etc. But sometimes LoadLeveler can be fooled into thinking that these are either POE or PVM3.3 jobs, and, at the very least it will produce a list of allocated nodes, which user programs or daemons can then distribute themselves over.

How to submit a POE job

POE jobs are easiest to run under LoadLeveler. There is always the same #@executable=/usr/bin/poe involved, regardless of whether the job is an MPI, MPL, HPF, or Linda job. Only for PVMe jobs the executable is different: #@executable=/usr/lpp/pvme/bin/pvmd3e. The only difference between, say, an MPI and an HPF job is the compiler, that's been used to produce the POE binary.

At the very least POE jobs on our system should be specified by the following LoadLeveler keywords:

# @ environment
there is a number of POE related environmental variables, which specify, e.g., dynamic libraries to be used for the run
# @ min_processors
# @ max_processors
# @ requirements
switch interface specifications go here, e.g., Adapter == hps_ip, which means ``use IP protocol over the switch''.
# @ job_type = parallel

How to submit a POE/MPI job

Here is an example of an MPI version of "hello world". The program itself looks as follows:

#include <stdio.h>
#include <mpi.h>

main(argc, argv)
int argc;
char *argv[];
{
        char name[BUFSIZ];
        int length;

        MPI_Init(&argc, &argv);
        MPI_Get_processor_name(name, &length);
        printf("%s: hello world\n", name);
        MPI_Finalize();
}
Compile it with the command
$ mpcc mpi-hello.c -o mpi-hello
and run by submitting the following LoadLeveler script:
# @ initialdir = /home/qpsf/gustav/src/try
# @ job_type = parallel
# @ environment = COPY_ALL; MP_EUILIB=ip
# @ requirements = (Adapter == "hps_ip")
# @ min_processors = 4
# @ max_processors = 8
# @ output = mpi-hello.out
# @ error = mpi-hello.err
# @ executable = /usr/bin/poe
# @ arguments = mpi-hello
# @ class = half_hour
# @ notification = always
# @ queue

This is what the file mpi-hello.out may look like after the run is finished:

s1n05: hello world
s1n04: hello world
s1n10: hello world
s1n09: hello world
s1n08: hello world
s1n06: hello world
s1n07: hello world
s1n03: hello world
The file mpi-hello.err will contain a message from poe:
WARNING: 0031-408  8 nodes allocated by LoadLeveler, continuing...

It is possible to increase markedly the level of poe verbosity by adding the evironmental variable MP_INFOLEVEL to the list specified by the #@environment keyword:

# @ environment=COPY_ALL; MP_EUILIB=ip; MP_INFOLEVEL=6
Numerous diagnostic messages will be then written on mpi-hello.err (about 20kB in case of this little program, sic!).

If you need to perform certain manipulations before and after running the main executable you can use scripting the same way as has already been discussed for sequential programs. For example, you could run the MPI "hello world" example as follows:

# @ shell = /opt/gnu/bin/bash
# @ initialdir = /home/qpsf/gustav/src/try
# @ job_type = parallel
# @ environment = COPY_ALL; MP_EUILIB=ip; MP_INFOLEVEL=3
# @ requirements = (Adapter == "hps_ip")
# @ min_processors = 4
# @ max_processors = 8
# @ output = mpi-hello.out
# @ error = mpi-hello.err
# @ class = half_hour
# @ notification = always
# @ queue
poe mpi-hello

The next example shows how you can have even more control over the way your POE job is run under LoadLeveler:

# @ shell = /opt/gnu/bin/bash
# @ initialdir = /home/qpsf/gustav/src/try
# @ job_type = parallel
# @ environment = COPY_ALL; 
# @ requirements = (Adapter == "hps_ip")
# @ min_processors = 4
# @ max_processors = 8
# @ output = mpi-hello.out
# @ error = mpi-hello.err
# @ class = half_hour
# @ notification = always
# @ queue
> host.list.$LOADL_STEP_ID
NPROC=0
for node in $LOADL_PROCESSOR_LIST
do
   echo $node >> host.list.$LOADL_STEP_ID
   NPROC=`expr $NPROC + 1`
done
#
export MP_HOSTFILE=host.list.$LOADL_STEP_ID
export MP_PROCS=$NPROC
export MP_EUILIB=ip
export MP_EUIDEVICE=css0
export MP_INFOLEVEL=3
#
poe mpi-hello 
#
rm $MP_HOSTFILE

This time I construct dynamically the POE host file and pass it to POE via the MP_HOSTFILE environmental variable. At the same time I count the number of allocated nodes. I have requested that number to be between 4 and 8, but the exact number will be known only when the job starts. That number is passed to POE via the MP_PROCS environmental variable.

It is instructive to have a look at the mpi-hello.err file produced by the run. The file may begin with something like:

INFO: DEBUG_LEVEL changed from 0 to 1
D1<L1>: Open of file host.list.s1n01.qpsf.edu.au.17791.0 successful
D1<L1>: mp_euilib = ip
D1<L1>: node allocation strategy = 1
INFO: 0031-119  Host s1n04.qpsf.edu.au allocated for task 0
INFO: 0031-119  Host s1n05.qpsf.edu.au allocated for task 1
INFO: 0031-119  Host s1n07.qpsf.edu.au allocated for task 2
INFO: 0031-119  Host s1n10.qpsf.edu.au allocated for task 3
INFO: 0031-119  Host s1n03.qpsf.edu.au allocated for task 4
INFO: 0031-119  Host s1n09.qpsf.edu.au allocated for task 5
INFO: 0031-119  Host s1n08.qpsf.edu.au allocated for task 6
INFO: 0031-119  Host s1n06.qpsf.edu.au allocated for task 7
which shows that host names were obtained from the dynamically generated host file, and it also shows which task runs on which host.

Throughout the file you will find 8 occurrences of a line:

INFO: 0031-724  Executing program: <mpi-hello>
which shows that all POE tasks successfully located and loaded the executable mpi-hello. You will also find there lines such as
D1<L1>: init_data for task 1: <203.2.136.7:4320>
D1<L1>: init_data for task 2: <203.2.136.9:4688>
D1<L1>: init_data for task 4: <203.2.136.5:4940>
D1<L1>: init_data for task 7: <203.2.136.8:4066>
The IP numbers, which appear in the brackets, e.g., 203.2.136.7 correspond to the HPS interfaces, which means that all communication takes place over the switch, even though the names specified in the host file correspond to ethernet interfaces, sic!

Then you can see lines such as

INFO: 0031-656  I/O file STDOUT closed by task 6
INFO: 0031-656  I/O file STDOUT closed by task 2
INFO: 0031-656  I/O file STDOUT closed by task 4
...
and
INFO: 0031-251  task 5 exited: rc=0
INFO: 0031-251  task 6 exited: rc=0
INFO: 0031-251  task 7 exited: rc=0
INFO: 0031-251  task 1 exited: rc=0
...
The file ends with
D1<L1>: All remote tasks have exited: maxx_errcode = 0
INFO: 0031-639  Exit status from pm_respond = 0
D1<L1>: Maximum return code from user = 0

These lines show return codes for the tasks. If something goes wrong with any of them by inspecting the mpi-hello.err file you can locate the offending process.

How to submit a POE/HPF job

High Performance Fortran jobs are POE jobs, and as such they are run under LoadLeveler or interactively in the same way as POE/MPI jobs.

Article ``High Performance Fortran'' discusses

In short, the program is called jacobi.f. It is compiled like that:

$ xlhpf90 -o jacobi jacobi.f
and the LoadLeveler script used to run it looks as follows:
# @ initialdir = /home/qpsf/gustav/HPF/aix
# @ job_type = parallel
# @ environment = COPY_ALL; MP_EUILIB=ip; MP_INFOLEVEL=6
# @ requirements = (Adapter == "hps_ip")
# @ min_processors = 4
# @ max_processors = 8
# @ output = jacobi.out
# @ error = jacobi.err
# @ executable = /usr/bin/poe
# @ arguments = jacobi
# @ class = half_hour
# @ notification = always
# @ queue
As you see there is nothing really HPF specific here. Running the job with MP_INFOLEVEL=6 will show details of communication taking place between participating processes on the jacobi.err file. For example:
D3<L4>: Message type 21 from source 0
D3<L4>: Message type 44 from source 5
D3<L4>: Message type 15 from source 0

How to submit a POE/Linda job

You will find Linda examples in the directory /opt/linda/common. The subdirectory cl-examples contains C and C++ examples, and the subdirectory fl-examples contains Fortran-77 examples.

There is a ping.cl example in cl-examples. That program creates two concurrent processes, which play a sort of a ping-pong game. If you copy that file to your $HOME directory, and if you add /opt/linda/sp2dm-4.1/bin to your command search PATH, you can compile and link that program with the command:

gustav@s1n01:~/cl-examples 367 $ clc -g -o ping ping.cl
clc (V4.0.1 Distributed Memory)
gustav@s1n01:~/cl-examples 368 $ 

To run that program, simply edit the following file, and save it on, say, ping.ll:

# @ initialdir = /home/qpsf/gustav/cl-examples
# @ job_type = parallel
# @ environment = COPY_ALL; MP_EUILIB=ip; MP_INFOLEVEL=0
# @ requirements = (Adapter == "hps_ip")
# @ min_processors = 4
# @ max_processors = 8
# @ output = ping.out
# @ error = ping.err
# @ executable = /usr/bin/poe
# @ arguments = ./ping 1000
# @ class = half_hour
# @ notification = always
# @ queue
As you see, as in our HPF example, there is nothing here that would be Linda specific. As far as LoadLeveler, AIX, and SP are concerned, this is just a POE job.

Submit that file to LoadLeveler with the command

gustav@s1n01:~/cl-examples 368 $ llsubmit ping.ll
submit: The job "s1n01.17931" has been submitted.
gustav@s1n01:~/cl-examples 369 $ 

After the job exits, you will see the following on ping.out:

sp2dm Linda Runtime System: version Tue Jun 11 13:24:20 EDT 1996
 Note            Since Start    Since Last
timer started.         0.000         0.000
evals done.            0.003         0.003
done.                  1.319         1.316

Here is another example, which requires a more elaborate LoadLeveler script. The program below is a sort of Linda distributed "Hello world", similar to our MPI example:

#include <stdio.h>

int real_main(argc, argv)
int argc;
char *argv[];
{
   int nworker, j, hello();
   char name[BUFSIZ];

   nworker=atoi(argv[1]);
   gethostname(name, BUFSIZ);
   printf("%s: the master process.\n", name);
   printf("%s: spawning function \"hello\" on %d worker processes.\n", 
          name, nworker);
   for (j=0; j < nworker; j++) eval("worker", hello(j));
   printf("%s: receiving %d \"dones\" from worker processes.\n",
          name, nworker);
   for (j=0; j < nworker; j++) in("done");
   printf("%s: finished.\n", name);
   return(0);
}

int hello(i)
int i;
{
   char name[BUFSIZ];

   gethostname(name, BUFSIZ);
   printf("\tHello from number %d running on %s.\n", i, name);
   out("done");
   return(0);
}

When this program is invoked it takes the number of slaves from the command line. If we specify different values for #@min_processors and #@max_processors, we will not know how many processors we are going to get until the LoadLeveler script actually runs. So the trick is to inform the Linda program that it should spawn, say, 7 workers if LoadLeveler allocates 8 processors to this run (the 8th process is the master process). Below is a script which accomplishes that. Observe that this time we manipulate the POE environment ourselves, taking from LoadLeveler a list of allocated nodes ($LOADL_PROCESSOR_LIST) and a step id number ($LOADL_STEP_ID) only. The latter is used to generate a unique file name for the host list file.

# @ shell = /opt/gnu/bin/bash
# @ initialdir = /home/qpsf/gustav/cl-examples
# @ job_type = parallel
# @ environment = COPY_ALL; 
# @ requirements = (Adapter == "hps_ip")
# @ min_processors = 4
# @ max_processors = 8
# @ output = hello.out
# @ error = hello.err
# @ class = half_hour
# @ notification = always
# @ queue
> host.list.$LOADL_STEP_ID
NPROC=0
for node in $LOADL_PROCESSOR_LIST
do
   echo $node >> host.list.$LOADL_STEP_ID
   NPROC=`expr $NPROC + 1`
done
#
export MP_HOSTFILE=host.list.$LOADL_STEP_ID
export MP_PROCS=$NPROC
export MP_EUILIB=ip
export MP_EUIDEVICE=css0
export MP_INFOLEVEL=3
#
poe ./hello `expr $NPROC - 1`
#
rm $MP_HOSTFILE

The hello.out file generated by this example may look something like that:

gustav@s1n01:~/cl-examples 482 $ cat hello.out
sp2dm Linda Runtime System: version Tue Jun 11 13:24:20 EDT 1996
s1n07: the master process.
        Hello from number 0 running on s1n03.
        Hello from number 1 running on s1n10.
        Hello from number 2 running on s1n06.
        Hello from number 3 running on s1n08.
        Hello from number 5 running on s1n04.
        Hello from number 6 running on s1n09.
        Hello from number 4 running on s1n05.
s1n07: spawning function "hello" on 7 worker processes.
s1n07: receiving 7 "dones" from worker processes.
s1n07: finished.
gustav@s1n01:~/cl-examples 483 $ 
Observe that messages arriving from various processes may not necessarily be registered by the system and displayed in the order in which you think they should have arrived. For example, the message about spawning function "hello" chronologically was produced before the workers had a chance to generate their Hello from number messages. Yet in the listing above it appears after those Hello messages.

How to submit a PVMe job

PVMe is an implementation of PVM designed for use on an IBM SP system with the High Performance Switch. Although compatible with PVM, PVMe has a different internal structure, not in the least because it runs on a homogeneous platform, and because it makes use of special primitives for task synchronisation.

There is an example PVM benchmark code in /usr/lpp/pvme/sample. In order to compile and run that benchmark, do as follows. First, create the following directories in your $HOME:

Now, copy the whole directory /usr/lpp/pvme/sample to your $HOME/pvm3/src, go to $HOME/pvm3/src/sample, and edit file parabench replacing
       PARAMETER(IPROC=1)
with, say,
       PARAMETER(IPROC=5)
(this means that PVMe will attempt to spawn 5 processes when this program is run). Then type:
gustav@s1n01:~/pvm3/src/sample 823 $ make
xlf -c -O hostbenc.f
** hostbenc   === End of Compilation 1 ===
1501-510  Compilation successful for file hostbenc.f.
xlf -L/usr/lpp/ssp/css/libus -lmpci hostbenc.o -lpvm3 -lfpvm3 \
    -o /home/qpsf/gustav/pvm3/bin/RS6K/hostbenc \
    -bI:/usr/lib/pvm3e.exp
xlf -c -O nodebenc.f
** nodebenc   === End of Compilation 1 ===
1501-510  Compilation successful for file nodebenc.f.
xlf -L/usr/lpp/ssp/css/libus -lmpci nodebenc.o -lpvm3 -lfpvm3 \
    -o /home/qpsf/gustav/pvm3/bin/RS6K/nodebenc \
    -bI:/usr/lib/pvm3e.exp
gustav@s1n01:~/pvm3/src/sample 824 $ 

Because the binaries have been linked with the libus libraries instead of libip libraries, they will have to be run in the user mode, i.e., only this one parallel job will be allowed to run on the nodes allocated by LoadLeveler. So, we will have to run it on pool 2, submitting the job to a dedicated class.

My LoadLeveler script for this job looks as follows:

# @ shell = /opt/gnu/bin/bash
# @ initialdir = /home/qpsf/gustav/pvm3/bin/RS6K
# @ job_type = parallel
# @ environment = COPY_ALL; 
# @ requirements = (Adapter == "hps_user")
# @ min_processors = 6
# @ max_processors = 6
# @ output = hostbenc.out
# @ error = hostbenc.err
# @ class = half_hour_dedicated
# @ notification = always
# @ queue
ln -s /usr/bin/rsh $HOME/bin/rsh
hash -r
/usr/lpp/pvme/bin/pvmd3e \
   -exec /home/qpsf/gustav/pvm3/bin/RS6K/hostbenc << EOF
40000
5
3
EOF
rm $HOME/bin/rsh
Observe that since I will be spawning 5 PVMe processes within this job, I need 6 processors altogether. The 6th process is the master. Also observe that before invoking pvmd3e I have linked /usr/bin/rsh to my $HOME/bin. That is because I use Kerberised rsh in my everyday life. But LoadLeveler will not pass on my Kerberos credentials. PVMe uses rsh to spawn processes - if I didn't replace Kerberised rsh with /usr/bin/rsh, that procedure would have generated errors and unnecessary delays. After the PVMe job exits I remove the link, which is no longer needed.

When the job has finished the file hostbenc.out will contain the benchmark results, which may look like that:

...
Local task:  TIME:   21017.9686546325684  MICROSECONDS .
Local task:  BANDWIDTH  7.61253393347188378  MB/s.
Local task:  BANDWIDTH PUT 12782.6407296316975  MB/s.
Local task:  BANDWIDTH SND 22.3030837436868055  MB/s.
...
As to the correctness of these results, I suspect a bug in the way the first BANDWIDTH is calculated (there is an underlying assumption there that there are only two communicating processes, but we have spawned five!), but... that's a different matter altogether. The difference between PUT and SND bandwidths illustrates how slow network communication is, even over the High Performance Switch, in comparison with memory-to-memory copy. You should always keep that in mind when developing distributed computer programs.

How to submit a parallel Gaussian job

Parallel Gaussian is somewhat similar to PVMe, because it is implemented on top of network-Linda, as opposed to POE-Linda discussed above. Network-Linda processes are spawned by /usr/bin/rsh. There is an additional complication there caused by the fact that a list of allocated nodes returned by LoadLeveler in $LOADL_PROCESSOR_LIST, corresponds to ethernet interfaces. Whereas POE and PVMe translate those names automatically to HPS interfaces, network-Linda doesn't do that. So, we have to do the translation ourselves, before starting the parallel Gaussian process.

The article ``The Parallel Gaussian'' discusses in detail how to run parallel Gaussian jobs under LoadLeveler on our system. Please refer to that article for more information.


For help and programming or academic assistance e-mail gustav@indiana.edu
Please e-mail any feedback related to this document to webmaster@beige.ucs.indiana.edu

[DocId:ll-hints.html, Version:1.17, Date:98/07/10 15:25:26]