next up previous index
Next: MPI and MPI-IO Up: Checkpointing and Resubmission Previous: Restoring, Timing and Saving

Combining the Application with PBS: Automatic Resubmission

In this section I shall demonstrate how our toy application can be run under the PBS, and how you can use its various features to automatically keep resubmitting the job until the whole computational task is finished.

The PBS script begins by changing to the GPFS directory where we run rts and then running program itself. The output is saved on rts.log:

cd /N/gpfs/gustav/rts
./rts > rts.log
After the job exits the script performs a number of manipulations. First of all, it checks if an environmental variable RSAVE_STEP exists. That variable is used to number our PBS runs. If the variable exists, it means that this particular run was already a resubmission. In that case the value of RSAVE_STEP is incremented and the old restart file, say, rts.dat.old is renamed to something like rts.dat.3, where 3 is the RSAVE_STEP number. This way we keep the log of the whole computation. In a more complex application the rts.dat files could contain images or three dimensional data sets that, if saved, could be used to produce an animation or a CAVE display.

If the variable RSAVE_STEP does not exist, it means that this is the initializing run. In that case the variable is created and assigned number 0. We will make it available to the next run by the means of the qsub -v option.

The log file, rts.log is also saved on something like, say, rts.log.3, where 3 is the RSAVE_STEP number. Observe that rts.log.3 corresponds to the run that used rts.dat.3 as its restart file.

After these manipulations we inspect the log file itself and check if it contains the word CONTINUE. If it does we resubmit the next iteration of the job with the command:

ssh bh1 "cd /N/B/gustav/PBS; /usr/pbs/bin/qsub \
-v RSAVE_TIME_LIMIT=30,RSAVE_CHECKFILE=rts.dat,RSAVE_RESTART=yes,\
RSAVE_STEP=$RSAVE_STEP rts.sh"
We pass RSAVE_RESTART=yes to the next job on the qsub command line, this way the next job will know it is a resubmission job, not an initialization job. We also pass the RSAVE_STEP to the next job, so that it can enumerate the log and data files properly.

If the word CONTINUE has not been found in the log file, then we check if the log file contains the word FINISHED. If the job is FINISHED it is not resubmitted. Instead a mail message is sent to me, in this case, that tells me about the completion of the computation. Observe that the mail message is sent from the head node, bh1, via ssh.

If neither the word CONTINUE nor the word FINISHED has been found in the log file, it means that an error condition must have occurred and the job exited mid-way. In that case, the job is not resubmitted and a mail message informing about the error is sent to me.

Here is the whole PBS script in full glory:

[gustav@bh1 PBS]$ cat rts.sh
#PBS -S /bin/bash
#PBS -q bg
#PBS -m a
#PBS -M gustav@indiana.edu
#PBS -V
#
cd /N/gpfs/gustav/rts
#
# Execute this step
#
rts > rts.log
#
# If there is $RSAVE_CHECKFILE.old file then
# replace the suffix ".old" with a step number
#
if [ -n "${RSAVE_STEP}" ]
then
   RSAVE_STEP=`expr $RSAVE_STEP + 1`
   if [ -n "${RSAVE_CHECKFILE}" ]
   then
      if [ -f $RSAVE_CHECKFILE.old ]
      then
         mv $RSAVE_CHECKFILE.old $RSAVE_CHECKFILE.$RSAVE_STEP
      fi
   fi
else
   RSAVE_STEP=0
fi
#
# save the log of this run
#
cp rts.log rts.log.$RSAVE_STEP
#
# Check if the job is finished and if it is not
# resubmit this file
#
if grep CONTINUE rts.log
then
   ssh bh1 "cd /N/B/gustav/PBS; /usr/pbs/bin/qsub -v RSAVE_TIME_LIMIT=30,RSAVE_CHECKFILE=rts.dat,RSAVE_RESTART=yes,RSAVE_STEP=$RSAVE_STEP rts.sh"
elif grep FINISHED rts.log
then
   ssh bh1 mail -s finished gustav@indiana.edu << EOF
Your job rts has FINISHED
EOF
else
   ssh bh1 mail -s error gustav@indiana.edu << EOF
rts: error exit, check the log file
EOF
fi
exit 0
[gustav@bh1 PBS]$
And here is how this script is submitted and what happens afterwards.
[gustav@bh1 PBS]$ env | grep RSAVE
RSAVE_TIME_LIMIT=30
RSAVE_CHECKFILE=rts.dat
[gustav@bh1 PBS]$ qsub rts.sh
17038.bh1.avidd.iu.edu
[gustav@bh1 PBS]$

Observe that only RSAVE_TIME_LIMIT and RSAVE_CHECKFILE have been defined. All other variables will be defined by the PBS script as they become needed.

The job runs happily resubmitting itself every time the program rts exits and producing numerous log and data files. In this case I have separated the PBS output, which is written on my job submission directory /N/B/gustav/PBS and the rts output, which is written on my GPFS directory /N/gpfs/gustav/rts.

Eventually I get a message that looks as follows

Envelope-to: gustav@woodlands.tqc.iu.edu
Delivery-date: Fri, 19 Sep 2003 17:40:03 -0500
Date: Fri, 19 Sep 2003 17:39:26 -0500
From: Zdzislaw Meglicki <gustav@bh1.uits.indiana.edu>
To: gustav@indiana.edu
Subject: finished

Your job rts has FINISHED
and then I can go to my PBS and GPFS directories and look at the logs.

I find the following PBS output and error files in /N/B/gustav/PBS:

[gustav@bh1 PBS]$ pwd
/N/B/gustav/PBS
[gustav@bh1 PBS]$ ls rts.sh.*
rts.sh.e17040  rts.sh.e17045  rts.sh.e17050  rts.sh.o17044  rts.sh.o17049
rts.sh.e17041  rts.sh.e17046  rts.sh.o17040  rts.sh.o17045  rts.sh.o17050
rts.sh.e17042  rts.sh.e17047  rts.sh.o17041  rts.sh.o17046
rts.sh.e17043  rts.sh.e17048  rts.sh.o17042  rts.sh.o17047
rts.sh.e17044  rts.sh.e17049  rts.sh.o17043  rts.sh.o17048
[gustav@bh1 PBS]$
Files rts.sh.o* show the job resubmission history:
[gustav@bh1 PBS]$ cat rts.sh.o*
CONTINUE
17041.bh1.avidd.iu.edu
CONTINUE
17042.bh1.avidd.iu.edu
CONTINUE
17043.bh1.avidd.iu.edu
CONTINUE
17044.bh1.avidd.iu.edu
CONTINUE
17045.bh1.avidd.iu.edu
CONTINUE
17046.bh1.avidd.iu.edu
CONTINUE
17047.bh1.avidd.iu.edu
CONTINUE
17048.bh1.avidd.iu.edu
CONTINUE
17049.bh1.avidd.iu.edu
CONTINUE
17050.bh1.avidd.iu.edu
FINISHED
[gustav@bh1 PBS]$
and files rts.sh.e* are, mercifully, empty.

Switching now to /N/gpfs/gustav/rts shows:

[gustav@bh1 PBS]$ cd /N/gpfs/gustav/rts
[gustav@bh1 rts]$ ls
rts.dat     rts.dat.3  rts.dat.7  rts.log.0   rts.log.3  rts.log.7
rts.dat.1   rts.dat.4  rts.dat.8  rts.log.1   rts.log.4  rts.log.8
rts.dat.10  rts.dat.5  rts.dat.9  rts.log.10  rts.log.5  rts.log.9
rts.dat.2   rts.dat.6  rts.log    rts.log.2   rts.log.6
[gustav@bh1 rts]$
Files rts.dat.* contain the animation of the computation process:
[gustav@bh1 rts]$ cat `ls -t rts.dat.*`
30
27
24
21
18
15
12
9
6
3
[gustav@bh1 rts]$
If these files contained images, I could make them into a movie. Files rts.log.* contain the detailed history of the whole computation:
[gustav@bh1 rts]$ cat `ls -t rts.log.*`
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 30
        computing ... 
                n = 31, time left = 25 seconds
        done.
n = 31
renaming old restart file to rts.dat.old
saving data on rts.dat
FINISHED
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 27
        computing ... 
                n = 28, time left = 25 seconds
                n = 29, time left = 20 seconds
                n = 30, time left = 15 seconds
                Run out of time, exiting ... 
        done.
n = 30
renaming old restart file to rts.dat.old
saving data on rts.dat
CONTINUE
...
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 3
        computing ... 
                n = 4, time left = 25 seconds
                n = 5, time left = 20 seconds
                n = 6, time left = 15 seconds
                Run out of time, exiting ... 
        done.
n = 6
renaming old restart file to rts.dat.old
saving data on rts.dat
CONTINUE
Time for this job limited to 30 seconds.
Starting a new run.
n = 0
        computing ... 
                n = 1, time left = 25 seconds
                n = 2, time left = 20 seconds
                n = 3, time left = 15 seconds
                Run out of time, exiting ... 
        done.
n = 3
saving data on rts.dat
CONTINUE
[gustav@bh1 rts]$

I have requested in the PBS script that PBS should send mail to me only when the job gets aborted. This is because I don't want to be flooded by excessive amount of mail. The PBS directive #PBS -m e would send me e-mail every time the script exits, not when the whole computation is done.


next up previous index
Next: MPI and MPI-IO Up: Checkpointing and Resubmission Previous: Restoring, Timing and Saving
Zdzislaw Meglicki
2004-04-29