next up previous index
Next: Timing a Job Up: Working with PBS Previous: PBS Dependency Lists

Checkpointing and Resubmission

In this section we will discuss how to time, checkpoint and automatically resubmit PBS jobs.

PBS can checkpoint under Cray  Unicos automatically, but not under Linux. But checkpointing is a very intricate matter anyway, and if the state of the program depends on external files as well as on its memory image, then automatic checkpointing may not capture the full state of the program correctly. Additional complications arise when you attempt to checkpoint  a parallel program. Consequently, it is usually a good idea to equip your program in its own checkpointing ability. This is not difficult and should be done for every serious project - especially if you are the developer.

The procedures discussed in this section are not limited to PBS. They should work for any batch submission system, as long as the batch jobs are described in terms of shell scripts, and as long as the system in question is  IEEE-1003  (POSIX) compliant. They are applicable both to sequential and parallel jobs.

There are four issues that need to be addressed when automatically checkpointing and resubmitting your PBS jobs.

Timing the job
Your job must know how much CPU or wall-clock time it used so far, and how much time there is still left.
Saving the state of the job
This usually involves dumping a data file which contains an essential summary of the state of the system that is being computed. That file will be read when the job is restarted, and computation will commence from the point reached when the file has been dumped.
Informing the parent process (usually a shell) that the computation should be continued
This can be done, for example, by exiting the job with a non-zero exit status. Alternatively you could write a specific message on a log file (to be searched for by the shell script when the job exits) or create an empty flag file.
Resubmitting the job
Depending on whether the job should be continued, the PBS script, before exiting, should either
resubmit itself, possibly with certain new flags or variables set up, or
clean up and inform the user that the computation has been completed.

next up previous index
Next: Timing a Job Up: Working with PBS Previous: PBS Dependency Lists
Zdzislaw Meglicki