next up previous index
Next: Timing a Job Up: Working with LoadLeveler Previous: Submitting POE/HPF Jobs

Checkpointing and Resubmission

In this section we will discuss how to time, checkpoint and automatically resubmit LoadLeveler jobs. Rather than relying on a special LoadLeveler mechanism for checkpointing jobs, which requires linking your program with LoadLeveler libraries, and which does not work for parallel jobs, and for other batch queueing systems, such as NQS, here I demonstrate how you can easily implement your own timing, checkpointing, and job resubmission mechanism in C, Fortran 90, and in Common Lisp.

The procedures discussed in this section are not limited to LoadLeveler. They should work for any batch submission system, as long as the batch jobs are described in terms of shell scripts, and as long as the system in question is IEEE-1003 (POSIX) compliant. They are applicable both to sequential and parallel jobs.

There are four issues that need to be addressed when automatically checkpointing and resubmitting your LoadLeveler jobs.



 
next up previous index
Next: Timing a Job Up: Working with LoadLeveler Previous: Submitting POE/HPF Jobs
Zdzislaw Meglicki
2001-02-26