Next: Timing a Job
Up: Working with LoadLeveler
Previous: Submitting POE/HPF Jobs
In this section we will discuss how to time, checkpoint and
automatically resubmit LoadLeveler jobs. Rather than relying on a
special LoadLeveler mechanism for checkpointing jobs, which requires linking your program with LoadLeveler
libraries, and which does not work for parallel jobs, and for other batch queueing systems, such as NQS, here I
demonstrate how you can easily implement your own timing, checkpointing, and job resubmission mechanism in
C, Fortran 90, and in Common Lisp.
The procedures discussed in this section are not limited to LoadLeveler. They should work for any batch
submission system, as long as the batch jobs are described in terms of shell scripts, and as long as the system in
question is IEEE-1003 (POSIX) compliant. They are applicable both to sequential and parallel jobs.
There are four issues that need to be addressed when automatically checkpointing and resubmitting your
LoadLeveler jobs.
- Timing the job: Your job must know how much CPU or wall-clock time it used so far, and how much time there is
still left.
- Saving the state of the job: This usually involves dumping a data file which contains an essential summary of the state of the
system that is being computed. That file will be read when the job is restarted, and computation will
commence from the point reached when the file has been dumped.
- Informing the parent process (usually a shell) that the computation
should be continued: This can be done, for example, by exiting the job with a non-zero exit status. Alternatively you
could write a specific message on a log file (to be searched for by the shell script when the job exits)
or create an empty flag file.
- Resubmitting the job: Depending on whether the job should be continued, the LoadLeveler script, before exiting, should
either
- 1.
- resubmit itself, possibly with certain new flags or variables set up, or
- 2.
- clean up and inform the user that the computation has been completed.
Next: Timing a Job
Up: Working with LoadLeveler
Previous: Submitting POE/HPF Jobs
Zdzislaw Meglicki
2001-02-26