next up previous index
Next: The Complete Application in Up: Checkpointing and Resubmission Previous: Restoring and Saving in

Restoring, Timing, and Saving a Job: the Complete Application

In this section we shall combine job timing with job restoring and saving, and produce a complete application, which, in the next section, will be combined with a LoadLeveler script, so as to produce an automatically resubmitting job. As in the previous two sections we shall present example codes in C, and in Fortran-90.

The program is a slight modification of our restore and save example. There are no really new elements here, which would require a broader explanation.

The additional logic that is laid out on top of the restore and save example is as follows.

We begin by checking for a new environmental variable, RSAVE_TIME_LIMIT. If that variable does not exist then we assume that time allowed for this job is unlimited and things work more or less as before. If the variable exists, then we attempt to read its value assuming that it is going to be a number. If it is not a number we print an error message and exit. If it is a number, then the number is assigned to variable time_limit and assumed to represent the number of wall-clock seconds allocated to this job.

On our IU system those queues, which are timed at all, are timed in terms of CPU seconds. But what really matters to other users and to yourself is how long you have to wait until your job gets out of the way. For this reason I use wall-clock timers, i.e., function time in C and subroutine system_clock in Fortran 90. However, you can modify the programs easily to look up the CPU time instead.

Once the information about time limit is obtained we proceed exactly as before, until we get to the part of the program which does the computation. Instead of just incrementing number n and sleeping for 5 seconds, we enter a loop.

If no timing has been requested the loop keeps incrementing n and sleeping, until n becomes greater than LAST_N. The latter is an arbitrary constant, which in our toy example represents something like a convergence criterion. Once the convergence has been reached, the finished flag is set to TRUE and the loop exits.

Things are more interesting if timing of the job has been requested (by setting the environmental variable RSAVE_TIME_LIMIT to some number of seconds). In that case we measure time taken by one iteration of the loop, and we check how much time there is still left after the iteration has finished. If there is still enough time to perform another iteration we continue, if not, the loop exits.

Because saving the data, cleaning up, and executing LoadLeveler script may take additional time we have to include a SAFETY_MARGIN while calculating time that still remains. In this case we set SAFETY_MARGIN to 10 seconds, but if you have to save a very large data set, you should probably reserve a couple of minutes.

Flagging the resubmission is accomplished as follows.

Before exiting, we check if the whole job is finished, which it will be once the convergence criterion is satisfied. If the job is finished we write FINISHED on standard output. Otherwise we write CONTINUE.

If the standard output has been logged on a file, after the program exits, the LoadLeveler script can inspect the log, and resubmit itself, if it finds the word CONTINUE in the log. How that works will be shown in the next section.

next up previous index
Next: The Complete Application in Up: Checkpointing and Resubmission Previous: Restoring and Saving in
Zdzislaw Meglicki