In this section I shall demonstrate how our toy application can be run under the PBS, and how you can use its various features to automatically keep resubmitting the job until the whole computational task is finished.
The PBS script begins by changing to the GPFS directory where we run rts and then running program itself. The output is saved on rts.log:
cd /N/gpfs/gustav/rts ./rts > rts.logAfter the job exits the script performs a number of manipulations. First of all, it checks if an environmental variable
RSAVE_STEP exists. That variable is used to number our PBS
runs. If the variable exists,
it means that this particular run was already a resubmission. In that
case the value of RSAVE_STEP is incremented and the old restart file,
say, rts.dat.old is renamed to something like
rts.dat.3, where 3 is
the RSAVE_STEP number. This way we keep the log of the whole
computation. In a more complex application the rts.dat files could
contain images or three dimensional data sets that, if saved, could
be used to produce an animation or a CAVE display.
If the variable RSAVE_STEP does not exist, it means that this
is the initializing run. In that case the variable is created and
assigned number 0. We will make it available to the next run by the
means of the qsub -v option.
The log file, rts.log is also saved on something like, say,
rts.log.3, where 3 is the RSAVE_STEP number. Observe
that rts.log.3 corresponds to the run that used
rts.dat.3 as its restart file.
After these manipulations we inspect the log file itself and check if
it contains the word CONTINUE. If it does we resubmit the next
iteration of the job with the command:
ssh bh1 "cd /N/B/gustav/PBS; /usr/pbs/bin/qsub \ -v RSAVE_TIME_LIMIT=30,RSAVE_CHECKFILE=rts.dat,RSAVE_RESTART=yes,\ RSAVE_STEP=$RSAVE_STEP rts.sh"We pass
RSAVE_RESTART=yes to the next job on the qsub
command line, this way the next job will know it is a resubmission
job, not an initialization job. We also pass the RSAVE_STEP
to the next job, so that it can enumerate the log and data
files properly.
If the word CONTINUE has not been found in the log file, then
we check if the log file contains the word FINISHED. If the job
is FINISHED it is not resubmitted. Instead a mail message is
sent to me, in this case, that tells me about the completion of the
computation. Observe that the mail message is sent from the
head node, bh1, via ssh.
If neither the word CONTINUE nor the word FINISHED
has been found in
the log file, it means that an error condition must have occurred and
the job exited mid-way. In that case, the job is not resubmitted and a
mail message informing about the error is sent to me.
Here is the whole PBS script in full glory:
[gustav@bh1 PBS]$ cat rts.sh
#PBS -S /bin/bash
#PBS -q bg
#PBS -m a
#PBS -M gustav@indiana.edu
#PBS -V
#
cd /N/gpfs/gustav/rts
#
# Execute this step
#
rts > rts.log
#
# If there is $RSAVE_CHECKFILE.old file then
# replace the suffix ".old" with a step number
#
if [ -n "${RSAVE_STEP}" ]
then
RSAVE_STEP=`expr $RSAVE_STEP + 1`
if [ -n "${RSAVE_CHECKFILE}" ]
then
if [ -f $RSAVE_CHECKFILE.old ]
then
mv $RSAVE_CHECKFILE.old $RSAVE_CHECKFILE.$RSAVE_STEP
fi
fi
else
RSAVE_STEP=0
fi
#
# save the log of this run
#
cp rts.log rts.log.$RSAVE_STEP
#
# Check if the job is finished and if it is not
# resubmit this file
#
if grep CONTINUE rts.log
then
ssh bh1 "cd /N/B/gustav/PBS; /usr/pbs/bin/qsub -v RSAVE_TIME_LIMIT=30,RSAVE_CHECKFILE=rts.dat,RSAVE_RESTART=yes,RSAVE_STEP=$RSAVE_STEP rts.sh"
elif grep FINISHED rts.log
then
ssh bh1 mail -s finished gustav@indiana.edu << EOF
Your job rts has FINISHED
EOF
else
ssh bh1 mail -s error gustav@indiana.edu << EOF
rts: error exit, check the log file
EOF
fi
exit 0
[gustav@bh1 PBS]$
And here is how this script is submitted and what happens afterwards. [gustav@bh1 PBS]$ env | grep RSAVE RSAVE_TIME_LIMIT=30 RSAVE_CHECKFILE=rts.dat [gustav@bh1 PBS]$ qsub rts.sh 17038.bh1.avidd.iu.edu [gustav@bh1 PBS]$
Observe that only RSAVE_TIME_LIMIT and RSAVE_CHECKFILE
have been defined. All other variables will be defined by the PBS
script as they become needed.
The job runs happily resubmitting itself every time the program
rts exits and producing numerous log and data files. In this
case I have separated the PBS output, which is written on my
job submission directory /N/B/gustav/PBS and the rts
output, which is written on my GPFS directory /N/gpfs/gustav/rts.
Eventually I get a message that looks as follows
Envelope-to: gustav@woodlands.tqc.iu.edu Delivery-date: Fri, 19 Sep 2003 17:40:03 -0500 Date: Fri, 19 Sep 2003 17:39:26 -0500 From: Zdzislaw Meglicki <gustav@bh1.uits.indiana.edu> To: gustav@indiana.edu Subject: finished Your job rts has FINISHEDand then I can go to my PBS and GPFS directories and look at the logs.
I find the following PBS output and error files in /N/B/gustav/PBS:
[gustav@bh1 PBS]$ pwd /N/B/gustav/PBS [gustav@bh1 PBS]$ ls rts.sh.* rts.sh.e17040 rts.sh.e17045 rts.sh.e17050 rts.sh.o17044 rts.sh.o17049 rts.sh.e17041 rts.sh.e17046 rts.sh.o17040 rts.sh.o17045 rts.sh.o17050 rts.sh.e17042 rts.sh.e17047 rts.sh.o17041 rts.sh.o17046 rts.sh.e17043 rts.sh.e17048 rts.sh.o17042 rts.sh.o17047 rts.sh.e17044 rts.sh.e17049 rts.sh.o17043 rts.sh.o17048 [gustav@bh1 PBS]$Files
rts.sh.o* show the job resubmission history:[gustav@bh1 PBS]$ cat rts.sh.o* CONTINUE 17041.bh1.avidd.iu.edu CONTINUE 17042.bh1.avidd.iu.edu CONTINUE 17043.bh1.avidd.iu.edu CONTINUE 17044.bh1.avidd.iu.edu CONTINUE 17045.bh1.avidd.iu.edu CONTINUE 17046.bh1.avidd.iu.edu CONTINUE 17047.bh1.avidd.iu.edu CONTINUE 17048.bh1.avidd.iu.edu CONTINUE 17049.bh1.avidd.iu.edu CONTINUE 17050.bh1.avidd.iu.edu FINISHED [gustav@bh1 PBS]$and files
rts.sh.e* are, mercifully, empty.
Switching now to /N/gpfs/gustav/rts shows:
[gustav@bh1 PBS]$ cd /N/gpfs/gustav/rts [gustav@bh1 rts]$ ls rts.dat rts.dat.3 rts.dat.7 rts.log.0 rts.log.3 rts.log.7 rts.dat.1 rts.dat.4 rts.dat.8 rts.log.1 rts.log.4 rts.log.8 rts.dat.10 rts.dat.5 rts.dat.9 rts.log.10 rts.log.5 rts.log.9 rts.dat.2 rts.dat.6 rts.log rts.log.2 rts.log.6 [gustav@bh1 rts]$Files
rts.dat.* contain the animation of the
computation process:[gustav@bh1 rts]$ cat `ls -t rts.dat.*` 30 27 24 21 18 15 12 9 6 3 [gustav@bh1 rts]$If these files contained images, I could make them into a movie. Files
rts.log.* contain the detailed history of the whole
computation:
[gustav@bh1 rts]$ cat `ls -t rts.log.*`
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 30
computing ...
n = 31, time left = 25 seconds
done.
n = 31
renaming old restart file to rts.dat.old
saving data on rts.dat
FINISHED
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 27
computing ...
n = 28, time left = 25 seconds
n = 29, time left = 20 seconds
n = 30, time left = 15 seconds
Run out of time, exiting ...
done.
n = 30
renaming old restart file to rts.dat.old
saving data on rts.dat
CONTINUE
...
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 3
computing ...
n = 4, time left = 25 seconds
n = 5, time left = 20 seconds
n = 6, time left = 15 seconds
Run out of time, exiting ...
done.
n = 6
renaming old restart file to rts.dat.old
saving data on rts.dat
CONTINUE
Time for this job limited to 30 seconds.
Starting a new run.
n = 0
computing ...
n = 1, time left = 25 seconds
n = 2, time left = 20 seconds
n = 3, time left = 15 seconds
Run out of time, exiting ...
done.
n = 3
saving data on rts.dat
CONTINUE
[gustav@bh1 rts]$
I have requested in the PBS script that PBS should send mail
to me only when the job gets aborted. This is because I don't
want to be flooded by excessive amount of mail. The PBS
directive #PBS -m e would
send me e-mail every time the script exits, not when the whole
computation is done.