In this section we shall combine job timing with job restoring and saving, and produce a complete application, which, in the next section, will be combined with a LoadLeveler script, so as to produce an automatically resubmitting job. As in the previous two sections we shall present example codes in C, and in Fortran-90.
The program is a slight modification of our restore and save example. There are no really new elements here, which would require a broader explanation.
The additional logic that is laid out on top of the restore and save example is as follows.
We begin by checking for a new environmental variable, RSAVE_TIME_LIMIT. If that variable does not exist then we assume
that time allowed for this job is unlimited and things work more or less as before. If the variable exists, then we attempt to
read its value assuming that it is going to be a number. If it is not a number we print an error message and exit. If it is a
number, then the number is assigned to variable time_limit and assumed to represent the number of wall-clock seconds
allocated to this job.
On our IU system those queues, which are timed at all, are timed in terms of CPU seconds. But
what really matters to other users and to yourself is how long you have to wait until your job gets out of the way. For this reason I use
wall-clock timers, i.e., function time in C
and subroutine system_clock in Fortran 90. However, you can modify
the programs easily to look up the CPU time instead.
Once the information about time limit is obtained we proceed exactly as before, until we get to the part of the program which
does the computation. Instead of just incrementing number n and sleeping for 5 seconds, we enter a loop.
If no timing has been requested the loop keeps incrementing n and sleeping, until n becomes greater than LAST_N. The latter
is an arbitrary constant, which in our toy example represents something like a convergence criterion. Once the convergence
has been reached, the finished flag is set to TRUE and the loop exits.
Things are more interesting if timing of the job has been requested (by setting the environmental variable RSAVE_TIME_LIMIT
to some number of seconds). In that case we measure time taken by one iteration of the loop, and we check how much time
there is still left after the iteration has finished. If there is still enough time to perform another iteration we continue, if not, the
loop exits.
Because saving the data, cleaning up, and executing LoadLeveler script may take additional time we have to include a
SAFETY_MARGIN while calculating time that still remains. In this case we set SAFETY_MARGIN to 10 seconds, but if you have to
save a very large data set, you should probably reserve a couple of minutes.
Flagging the resubmission is accomplished as follows.
Before exiting, we check if the whole job is finished, which it will be once the convergence criterion is satisfied. If the job is
finished we write FINISHED on standard output. Otherwise we write
CONTINUE.
If the standard output has been logged on a file, after the program exits, the LoadLeveler script can inspect the log, and resubmit itself, if it finds the word CONTINUE in the log. How that works will be shown in the next section.