next up previous index
Next: Restoring and Saving in Up: Checkpointing and Resubmission Previous: Timing a Job in

Restoring and Saving the State of a Job

In this section we shall discuss how to save and then restore the state of the computation between successive invocations of a program via LoadLeveler. The basic idea is that the only way any information can be transferred between successive invocations of a program is either

through a file, or
through an environmental variable, or
through command line switches
Transferring data through a file is perhaps the most common practice. Using files you can transfer very large amounts of data: e.g., the whole state of a 3D flow, or the whole state of a protein, or the whole state of a car in a crash simulation. Basically, files can be used to transfer any information from one instantiation of a program to another, including even small items of information, such as whether the program should restart a computation from a previously reached state, or whether it should start a new computation.

Instead of writing on files, the program, in principle, can also write on user's environment. On next invocation the program can check for existence and state of certain predefined environmental variables, and obtain required information that way. This method is good for transferring small amounts of information, e.g., the name of a checkpoint file, or the request to initialise a run, but not for very large data sets.

Of course, using environmental variables will not work if the variables themselves are not transmitted from one LoadLeveler process to another one. There is a special LoadLeveler directive:

# @ environment = COPY_ALL
which instructs LoadLeveler to copy all environmental variables from the current shell and transfer them to the shell within which the job will be executed.

But there is one problem with writing on user's environment. This can be done portably only from within C (or C++). Fortran-90 provides an intrinsic procedure for reading environment, getenv, but not for writing on it. Common Lisp, in turn, specifies only that such procedures should be available in the implementation dependent system (nickname: sys) package, but does not specify exactly what should be in that package. Most Lisps, that I know of, have sys::getenv, but, again, not all of them have sys::putenv or sys::setenv.

So, we shall have to use some other mechanism to convey information about the name of the checkpoint file and whether the job should be continued, for example, we can write it at the end of a log file. After our application exits, the LoadLeveler script responsible for the execution of the application can inspect the log, and if it finds the instruction that the job should be continued, it can transfer that information to environmental variables, and resubmit itself. When LoadLeveler again gets to activate that script, our application will begin by checking for certain variables in the environment and for their content. From there it will learn if it should continue or reinitialise the computation, and if it should continue, where it should look for information about the state reached by the previous run.

Instead of using environment we could transfer the instruction to restart the computation and the name of the checkpoint file by using command line arguments. This is a neat way of doing things, but it's somewhat harder to program than reading the environment. You can use this mechanism portably with C and C++, but not with Fortran or Lisp programs. Almost all Fortrans, that I have worked with, support reading command line arguments, but they all do it differently, and, as I said, it's not a part of Fortran standard.

next up previous index
Next: Restoring and Saving in Up: Checkpointing and Resubmission Previous: Timing a Job in
Zdzislaw Meglicki