Next: Timing a Job
Up: Working with PBS
Previous: PBS Dependency Lists
Checkpointing and Resubmission
In this section we will discuss how to time, checkpoint and
automatically resubmit PBS jobs.
PBS can checkpoint under Cray
Unicos automatically, but not under Linux.
But checkpointing is a very intricate matter anyway, and if the state
of the program depends on external files as well as on its
memory image, then automatic checkpointing may not capture the full state
of the program correctly. Additional complications arise when
you attempt to checkpoint
a parallel program. Consequently, it is
usually a good idea to equip your program in its own checkpointing
ability. This is not difficult and should be done for every
serious project - especially if you are the developer.
The procedures discussed in this section are not limited to
PBS. They should work for any batch submission system, as long
as the batch jobs are described in terms of shell scripts, and as long
as the system in question is IEEE-1003
(POSIX) compliant. They are
applicable both to sequential and parallel jobs.
There are four issues that need to be addressed when automatically
checkpointing and resubmitting your PBS jobs.
- Timing the job
- Your job must know how much CPU or wall-clock time
it used so far, and how much time there is still left.
- Saving the state of the job
- This usually involves dumping
a data file which contains an essential summary of the state of
the system that is being computed. That file will be read when
the job is restarted, and computation will commence from
the point reached when the file has been dumped.
- Informing the parent process (usually a shell) that
the computation should be continued
- This can be done, for example,
by exiting the job with a non-zero exit status. Alternatively you could
write a specific message on a log file (to be searched for by the shell
script when the job exits) or create an empty flag file.
- Resubmitting the job
- Depending on whether the job should be
continued, the PBS script, before exiting, should either
- 1.
- resubmit itself, possibly with certain new flags or variables
set up, or
- 2.
- clean up and inform the user that the computation has been completed.
Next: Timing a Job
Up: Working with PBS
Previous: PBS Dependency Lists
Zdzislaw Meglicki
2004-04-29