next up previous index
Next: Message Passing Interface Up: Checkpointing and Resubmission Previous: The Complete Application in

Combining the Application with LoadLeveler: Automatic Resubmission

In this section I shall demonstrate how our toy application can be run under the LoadLeveler, and how you can use its various features to automatically keep resubmitting the job until the whole computational task is finished.

What makes it particularly easy is the LoadLeveler's

#@environment=COPY_ALL
statement, which transfers all currently defined environmental variables to the submitted job. That way we can define, say, RSAVE_RESTART in the script, after the first, initialising run of the application, and rest assured that when the job is resubmitted, it will already read the data from the restart file.

The LoadLeveler script begins by running program ./rts: that is our application. The output is saved on rts.log:

./rts > rts.log

Both C and Fortran examples are invoked in the same way.

After the job exits the script performs a number of quite interesting manipulations. First of all, it checks if an environmental variable RSAVE_STEP exists. That variable is used to number our LoadLeveler runs. It is quite like LoadLeveler's variable $(stepid), with the difference that here we do it all ourselves. If the variable exists, it means that this particular run was already a resubmission. In that case the value of RSAVE_STEP is incremented and the old restart file, say, rts.dat.old is renamed to something like rts.dat.3,, where 3 is the RSAVE_STEP number. That way we keep the log of the whole computation. In a more complex application, the rts.dat files could contain images or three dimensional data sets, which, if saved, could be used to produce an animation or a CAVE display.

If the variable RSAVE_STEP does not exist, it means that this is the initialising run. In that case the variable is created and assigned number 0. Because we export it, it will become available to the next instantiation of the job.

The log file, rts.log is also saved on something like, say, rts.log.3, where 3 is the RSAVE_STEP number. Observe that rts.log.3 corresponds to the run that used rts.dat.3 as its restart file.

After these manipulations we inspect the log file itself and check if it contains the word CONTINUE. If it does, we check if the variable RSAVE_RESTART exists. If it doesn't, it means that this was the first, initialising run. So we create that variable. Once created it will become available to the next instantiation of the job via the #@environment=COPY_ALL mechanism. Either way the job is resubmitted with the command

llsubmit $LOADL_STEP_COMMAND
where $LOADL_STEP_COMMAND evaluates to the name of the LoadLeveler script itself.

If the word CONTINUE has not been found in the log file, then we check if the log file contains the word FINISHED. If the job is FINISHED it is not resubmitted. Instead a mail message is sent to whoever submitted the job in the first place ($LOADL_STEP_OWNER), informing the addressee that the job has been completed.

If neither the word CONTINUE nor the word FINISHED have been found in the log file, it means that an error condition must have occurred and the job exited mid-way. In that case, the job is not resubmitted and a mail message informing about the error is sent to the $LOADL_STEP_OWNER.

Here is the whole LoadLeveler script in full glory:

# @ shell = /afs/ovpit.indiana.edu/@sys/gnu/bin/bash
# @ environment = COPY_ALL
# @ job_name = rts
# @ output = $(job_name).$(jobid).out
# @ error = $(job_name).$(jobid).err
# @ class = test
# @ notification = never
# @ queue
#
# Execute this step.
#
./rts > rts.log
#
# If there is $RSAVE_CHECKFILE.old file then
# replace the suffix ".old" with a step number.
#
if [ -n "${RSAVE_STEP}" ]
then
   export RSAVE_STEP=`expr $RSAVE_STEP + 1`
   if [ -n "${RSAVE_CHECKFILE}" ]
   then
      if [ -f $RSAVE_CHECKFILE.old ]
      then
         mv $RSAVE_CHECKFILE.old $RSAVE_CHECKFILE.$RSAVE_STEP
      fi
   fi
else
   export RSAVE_STEP=0
fi
# 
# also save the log of this run 
#
cp rts.log rts.log.$RSAVE_STEP
#
# Check if the job is finished and if it is not
# resubmit this file
#
if grep CONTINUE rts.log
then
   if [ -z "${RSAVE_RESTART}" ]
   then
      export RSAVE_RESTART=yes
   fi
   llsubmit $LOADL_STEP_COMMAND
elif grep FINISHED rts.log
then
   mailx $LOADL_STEP_OWNER << EOF
Your job rts has FINISHED
EOF
else
   mailx $LOADL_STEP_OWNER << EOF
rts: error exit, check the log file
EOF
fi

Here is how this script is submitted and what happens afterwards.

gustav@sp19:../LoadLeveler 15:09:24 !620 $ env | grep RSAVE
RSAVE_TIME_LIMIT=30
RSAVE_CHECKFILE=rts.dat
gustav@sp19:../LoadLeveler 15:09:35 !621 $ llsubmit rts.ll
submit: The job "sp19.104" has been submitted.
gustav@sp19:../LoadLeveler 15:09:40 !622 $
Observe that only RSAVE_TIME_LIMIT and RSAVE_CHECKFILE have been defined. All other variables will be defined by the LoadLeveler script as they become needed.

The job runs happily resubmitting itself every time the program rts exits and producing numerous log and data files:

gustav@sp19:../LoadLeveler 15:27:09 !695 $ ls rts*
rts          rts.169.err  rts.448.out  rts.dat.10   rts.f        rts.log.5
rts.105.err  rts.169.out  rts.449.err  rts.dat.2    rts.ll       rts.log.6
rts.105.out  rts.445.err  rts.449.out  rts.dat.3    rts.log      rts.log.7
rts.166.err  rts.445.out  rts.98.err   rts.dat.4    rts.log.0    rts.log.8
rts.166.out  rts.446.err  rts.98.out   rts.dat.5    rts.log.1    rts.log.9
rts.167.err  rts.446.out  rts.c        rts.dat.6    rts.log.10
rts.167.out  rts.447.err  rts.cpp      rts.dat.7    rts.log.2
rts.168.err  rts.447.out  rts.dat      rts.dat.8    rts.log.3
rts.168.out  rts.448.err  rts.dat.1    rts.dat.9    rts.log.4
gustav@sp19:../LoadLeveler 15:30:13 !696 $
The rts.dat.* files contain the evolution (or animation) of the system:
 gustav@sp19:../LoadLeveler 15:30:13 !696 $ cat `ls -t rts.dat.*`
     30
     27
     24
     21
     18
     15
     12
      9
      6
      3
gustav@sp19:../LoadLeveler 15:31:22 !697 $
The rts.log.* files contain the log of the whole computation:
gustav@sp19:../LoadLeveler 15:31:22 !697 $ cat `ls -t rts.log.*`
 Time for this job limited to      30 seconds
 Restarting the job from rts.dat                                                         
 n =      30
         computing ... 
                n =      31 time left =      25 seconds
         done.
 n =      31
 Renaming the old restart file to rts.dat.old                                                     
 Saving data on rts.dat                                                         
 FINISHED
 Time for this job limited to      30 seconds
 Restarting the job from rts.dat                                                         
 n =      27
         computing ... 
                n =      28 time left =      25 seconds
                n =      29 time left =      20 seconds
                n =      30 time left =      15 seconds
                Run out of time, exiting ... 
         done.
 n =      30
 Renaming the old restart file to rts.dat.old                                                     
 Saving data on rts.dat                                                         
 CONTINUE

...


 Time for this job limited to      30 seconds
 Restarting the job from rts.dat                                                         
 n =       3
         computing ... 
                n =       4 time left =      25 seconds
                n =       5 time left =      20 seconds
                n =       6 time left =      15 seconds
                Run out of time, exiting ... 
         done.
 n =       6
 Renaming the old restart file to rts.dat.old                                                     
 Saving data on rts.dat                                                         
 CONTINUE
 Time for this job limited to      30 seconds
 Starting a new run
 n =       0
         computing ... 
                n =       1 time left =      25 seconds
                n =       2 time left =      20 seconds
                n =       3 time left =      15 seconds
                Run out of time, exiting ... 
         done.
 n =       3
 Saving data on rts.dat                                                         
 CONTINUE
gustav@sp19:../LoadLeveler 15:32:16 !698 $

And the rts.*.out files contain messages from the LoadLeveler script in its various instantiations:

gustav@sp19:../LoadLeveler 15:32:16 !698 $ cat `ls -t rts.*.out`
 FINISHED
 CONTINUE
submit: The job "sp18.169" has been submitted.
 CONTINUE
submit: The job "sp17.449" has been submitted.
 CONTINUE
submit: The job "sp18.168" has been submitted.
 CONTINUE
submit: The job "sp17.448" has been submitted.
 CONTINUE
submit: The job "sp18.167" has been submitted.
 CONTINUE
submit: The job "sp17.447" has been submitted.
 CONTINUE
submit: The job "sp18.166" has been submitted.
 CONTINUE
submit: The job "sp17.446" has been submitted.
 CONTINUE
submit: The job "sp21.98" has been submitted.
 CONTINUE
submit: The job "sp17.445" has been submitted.
gustav@sp19:../LoadLeveler 15:34:36 !699 $

When the whole job finished I have received the following mail message sent to me by the LoadLeveler script:

Date: Tue, 26 Jan 1999 15:26:56 -0500
From: Zdzislaw Meglicki <gustav@sp17.ucs.indiana.edu>
Message-Id: <199901262026.PAA18102@sp17.ucs.indiana.edu>
To: gustav@sp17.ucs.indiana.edu
Content-Type: text
Content-Length: 26

Your job rts has FINISHED

If you run a long job, which resubmits itself twice or perhaps only once a day, it is a good idea to change

#@notification = never
to
#@notification = always
so that you can keep an eye on the computation.


next up previous index
Next: Message Passing Interface Up: Checkpointing and Resubmission Previous: The Complete Application in
Zdzislaw Meglicki
2001-02-26