In this section I shall demonstrate how our toy application can be run under the LoadLeveler, and how you can use its various features to automatically keep resubmitting the job until the whole computational task is finished.
What makes it particularly easy is the LoadLeveler's
#@environment=COPY_ALLstatement, which transfers all currently defined environmental variables to the submitted job. That way we can define, say,
RSAVE_RESTART in the script, after the first, initialising run of the application, and rest assured that
when the job is resubmitted, it will already read the data from the restart file.
The LoadLeveler script begins by running program ./rts: that is our application. The output is saved
on rts.log:
./rts > rts.log
Both C and Fortran examples are invoked in the same way.
After the job exits the script performs a number of quite interesting manipulations. First of all, it checks
if an environmental variable
RSAVE_STEP exists. That variable is used to number our LoadLeveler
runs. It is quite like LoadLeveler's variable
$(stepid), with the difference that here we do it all
ourselves. If the variable exists, it means that this particular run was already a resubmission. In that
case the value of RSAVE_STEP is incremented and the old restart file, say, rts.dat.old is renamed to
something like rts.dat.3,, where 3 is the RSAVE_STEP number. That way we keep the log of the
whole computation. In a more complex application, the
rts.dat files could contain images or three
dimensional data sets, which, if saved, could be used to produce an animation or a CAVE display.
If the variable RSAVE_STEP does not exist, it means that this is the initialising run. In that case the
variable is created and assigned number 0. Because we export it, it will become available to the next
instantiation of the job.
The log file, rts.log is also saved on something like, say,
rts.log.3, where 3 is the RSAVE_STEP
number. Observe that rts.log.3 corresponds to the run that used
rts.dat.3 as its restart file.
After these manipulations we inspect the log file itself and check if it contains the word CONTINUE. If it
does, we check if the variable RSAVE_RESTART exists. If it doesn't, it means that this was the first,
initialising run. So we create that variable. Once created it will become available to the next
instantiation of the job via the
#@environment=COPY_ALL mechanism. Either way the job is
resubmitted with the command
llsubmit $LOADL_STEP_COMMANDwhere
$LOADL_STEP_COMMAND evaluates to the name of the LoadLeveler script itself.
If the word CONTINUE has not been found in the log file, then we check if the log file contains the word
FINISHED. If the job is FINISHED it is not resubmitted. Instead a mail message is sent to whoever
submitted the job in the first place ($LOADL_STEP_OWNER), informing the addressee that the job has
been completed.
If neither the word CONTINUE nor the word FINISHED have been found in the log file, it means that an
error condition must have occurred and the job exited mid-way. In that case, the job is not
resubmitted and a mail message informing about the error is sent to the
$LOADL_STEP_OWNER.
Here is the whole LoadLeveler script in full glory:
# @ shell = /afs/ovpit.indiana.edu/@sys/gnu/bin/bash
# @ environment = COPY_ALL
# @ job_name = rts
# @ output = $(job_name).$(jobid).out
# @ error = $(job_name).$(jobid).err
# @ class = test
# @ notification = never
# @ queue
#
# Execute this step.
#
./rts > rts.log
#
# If there is $RSAVE_CHECKFILE.old file then
# replace the suffix ".old" with a step number.
#
if [ -n "${RSAVE_STEP}" ]
then
export RSAVE_STEP=`expr $RSAVE_STEP + 1`
if [ -n "${RSAVE_CHECKFILE}" ]
then
if [ -f $RSAVE_CHECKFILE.old ]
then
mv $RSAVE_CHECKFILE.old $RSAVE_CHECKFILE.$RSAVE_STEP
fi
fi
else
export RSAVE_STEP=0
fi
#
# also save the log of this run
#
cp rts.log rts.log.$RSAVE_STEP
#
# Check if the job is finished and if it is not
# resubmit this file
#
if grep CONTINUE rts.log
then
if [ -z "${RSAVE_RESTART}" ]
then
export RSAVE_RESTART=yes
fi
llsubmit $LOADL_STEP_COMMAND
elif grep FINISHED rts.log
then
mailx $LOADL_STEP_OWNER << EOF
Your job rts has FINISHED
EOF
else
mailx $LOADL_STEP_OWNER << EOF
rts: error exit, check the log file
EOF
fi
Here is how this script is submitted and what happens afterwards.
gustav@sp19:../LoadLeveler 15:09:24 !620 $ env | grep RSAVE RSAVE_TIME_LIMIT=30 RSAVE_CHECKFILE=rts.dat gustav@sp19:../LoadLeveler 15:09:35 !621 $ llsubmit rts.ll submit: The job "sp19.104" has been submitted. gustav@sp19:../LoadLeveler 15:09:40 !622 $Observe that only
RSAVE_TIME_LIMIT and
RSAVE_CHECKFILE have been defined. All other variables
will be defined by the LoadLeveler script as they become needed.
The job runs happily resubmitting itself every time the program rts
exits and producing numerous log
and data files:
gustav@sp19:../LoadLeveler 15:27:09 !695 $ ls rts* rts rts.169.err rts.448.out rts.dat.10 rts.f rts.log.5 rts.105.err rts.169.out rts.449.err rts.dat.2 rts.ll rts.log.6 rts.105.out rts.445.err rts.449.out rts.dat.3 rts.log rts.log.7 rts.166.err rts.445.out rts.98.err rts.dat.4 rts.log.0 rts.log.8 rts.166.out rts.446.err rts.98.out rts.dat.5 rts.log.1 rts.log.9 rts.167.err rts.446.out rts.c rts.dat.6 rts.log.10 rts.167.out rts.447.err rts.cpp rts.dat.7 rts.log.2 rts.168.err rts.447.out rts.dat rts.dat.8 rts.log.3 rts.168.out rts.448.err rts.dat.1 rts.dat.9 rts.log.4 gustav@sp19:../LoadLeveler 15:30:13 !696 $The
rts.dat.* files contain the evolution (or animation) of the system:
gustav@sp19:../LoadLeveler 15:30:13 !696 $ cat `ls -t rts.dat.*`
30
27
24
21
18
15
12
9
6
3
gustav@sp19:../LoadLeveler 15:31:22 !697 $
The rts.log.* files contain the log of the whole computation:
gustav@sp19:../LoadLeveler 15:31:22 !697 $ cat `ls -t rts.log.*`
Time for this job limited to 30 seconds
Restarting the job from rts.dat
n = 30
computing ...
n = 31 time left = 25 seconds
done.
n = 31
Renaming the old restart file to rts.dat.old
Saving data on rts.dat
FINISHED
Time for this job limited to 30 seconds
Restarting the job from rts.dat
n = 27
computing ...
n = 28 time left = 25 seconds
n = 29 time left = 20 seconds
n = 30 time left = 15 seconds
Run out of time, exiting ...
done.
n = 30
Renaming the old restart file to rts.dat.old
Saving data on rts.dat
CONTINUE
...
Time for this job limited to 30 seconds
Restarting the job from rts.dat
n = 3
computing ...
n = 4 time left = 25 seconds
n = 5 time left = 20 seconds
n = 6 time left = 15 seconds
Run out of time, exiting ...
done.
n = 6
Renaming the old restart file to rts.dat.old
Saving data on rts.dat
CONTINUE
Time for this job limited to 30 seconds
Starting a new run
n = 0
computing ...
n = 1 time left = 25 seconds
n = 2 time left = 20 seconds
n = 3 time left = 15 seconds
Run out of time, exiting ...
done.
n = 3
Saving data on rts.dat
CONTINUE
gustav@sp19:../LoadLeveler 15:32:16 !698 $
And the rts.*.out files contain messages from the LoadLeveler script in its various instantiations:
gustav@sp19:../LoadLeveler 15:32:16 !698 $ cat `ls -t rts.*.out` FINISHED CONTINUE submit: The job "sp18.169" has been submitted. CONTINUE submit: The job "sp17.449" has been submitted. CONTINUE submit: The job "sp18.168" has been submitted. CONTINUE submit: The job "sp17.448" has been submitted. CONTINUE submit: The job "sp18.167" has been submitted. CONTINUE submit: The job "sp17.447" has been submitted. CONTINUE submit: The job "sp18.166" has been submitted. CONTINUE submit: The job "sp17.446" has been submitted. CONTINUE submit: The job "sp21.98" has been submitted. CONTINUE submit: The job "sp17.445" has been submitted. gustav@sp19:../LoadLeveler 15:34:36 !699 $
When the whole job finished I have received the following mail message sent to me by the LoadLeveler script:
Date: Tue, 26 Jan 1999 15:26:56 -0500 From: Zdzislaw Meglicki <gustav@sp17.ucs.indiana.edu> Message-Id: <199901262026.PAA18102@sp17.ucs.indiana.edu> To: gustav@sp17.ucs.indiana.edu Content-Type: text Content-Length: 26 Your job rts has FINISHED
If you run a long job, which resubmits itself twice or perhaps only once a day, it is a good idea to change
#@notification = neverto
#@notification = alwaysso that you can keep an eye on the computation.