In this section we shall combine job timing with job restoring and saving, and produce a complete application, which, in the next section, will be combined with a PBS script, so as to produce an automatically resubmitting job.
The program is a slight modification of our restore and save example. There are no really new elements here, which would require a broader explanation.
The additional logic that is laid out on top of the restore and save example is as follows.
We begin by checking for a new environmental variable,
RSAVE_TIME_LIMIT. If that variable does not exist then we assume that
time allowed for this job is unlimited and things work more or less as
before. If the variable exists then we attempt to read its value
assuming that it is going to be a number. If it is not a number we
print an error message and exit. If it is a number then the number is
assigned to variable time_limit and assumed to represent the number of
wall-clock seconds allocated to this job.
As I have mentioned before, what really matters to other users and to
the system administrators is how long they have to wait, in terms of
wall-clock time, until your job gets out of the way. For this reason
I use the wall-clock timer, i.e., function time. However, you
can modify the programs easily to look up the CPU time instead.
Once the information about the time limit is obtained we proceed
exactly as before, until we get to the part of the program that does
the computation. Instead of just incrementing number n and
sleeping for 5 seconds, we enter a loop.
If no timing has been requested the loop keeps incrementing n
and sleeping, until n becomes greater than LAST_N. The
latter is an arbitrary constant, which in our toy example represents
something like a convergence criterion. Once the ``convergence'' has
been reached, the finished flag is set to TRUE and the
loop exits.
Things are more interesting if timing of the job has been requested
(by setting the environmental variable RSAVE_TIME_LIMIT to some
number of seconds). In that case we measure time taken by one
iteration of the loop, and we check how much time there is still left
after the iteration has finished. If there is still enough time to
perform another iteration we continue, if not, the loop exits.
Because saving the data, cleaning up, and executing the PBS script may
take additional time we have to include a SAFETY_MARGIN while
calculating time that still remains. In this case we set
SAFETY_MARGIN to 10 seconds, but if you have to save a very
large data set, you should probably reserve a couple of minutes.
Flagging the resubmission is accomplished as follows.
Before exiting, we check if the whole job is finished, which it will
be once our ``convergence'' criterion is satisfied. If the job is finished
we write FINISHED on standard output.
Otherwise we write CONTINUE.
If the standard output has been logged on a file, after the program
exits, the PBS script can inspect the log, and resubmit itself if it
finds the word CONTINUE in the log.
Here is the program. The parameters LAST_N and
SAFETY_MARGIN have been implemented as cpp constants. I
could also read them from the environment, a command line, or from an
input file, but that would clutter the example.
The program always executes the statements of the do ... while
loop at least once, because the exit condition is tested at the end of
the loop.
Observe that the variable quit_time is initialised to
1l. That way, if timing is not requested, it remains always
positive and the while test fires up only when the job is
finished. Furthermore the variable timing is initialised to
TRUE, and becomes FALSE only if there is no
environmental variable RSAVE_TIME_LIMIT. The job is assumed to
be unfinished on entry (the variable finished is initialised to
FALSE) and becomes finished only when n becomes greater
than LAST_N. This means that once n becomes greater
than LAST_N, you can still submit the job and it will always
increment n by 1 before exiting.
/*
* rts: Restart Time and Save
*
* %Id: rts.c,v 1.1 2003/09/19 20:55:37 gustav Exp %
*
* %Log: rts.c,v %
* Revision 1.1 2003/09/19 20:55:37 gustav
* Initial revision
*
*
*/
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <time.h>
#ifndef TRUE
# define TRUE 1
#endif
#ifndef FALSE
# define FALSE 0
#endif
#ifndef LAST_N
# define LAST_N 30
#endif
#ifndef SAFETY_MARGIN
# define SAFETY_MARGIN 10
#endif
main()
{
char *restart_name, *restart, old_restart_name[BUFSIZ];
FILE *restart_file;
int n, finished = FALSE, timing = TRUE;
time_t t0, t1, t2, loop_time, time_left, time_limit, quit_time = 1l;
char *time_limit_string;
/* Check the clock at the beginning of the run */
t0 = time(NULL);
/* Check how much time we have for this job */
if (! (time_limit_string = getenv ("RSAVE_TIME_LIMIT"))) {
printf ("Unlimited time for this job.\n");
timing = FALSE;
}
else {
if (! (0 < sscanf (time_limit_string, "%d", &time_limit))) {
fprintf (stderr, "Error: bad format of RSAVE_TIME_LIMIT\n");
exit (1);
}
else {
printf ("Time for this job limited to %d seconds.\n", time_limit);
}
}
/* Is this a continued job or a new one? */
if (! (restart = getenv ("RSAVE_RESTART"))) {
printf ("Starting a new run.\n");
n = 0;
}
else {
if (! (restart_name = getenv ("RSAVE_CHECKFILE"))) {
fprintf (stderr, "error: no checkpoint file for the restart job\n");
exit (1);
}
else {
printf ("Restarting the job from %s.\n", restart_name);
if (! (restart_file = fopen(restart_name, "r"))) {
perror (restart_name);
exit (2);
}
else {
if (! (fscanf (restart_file, "%d", &n) > 0)) {
fprintf (stderr, "%s: input file format error\n", restart_name);
exit (3);
}
else {
fclose (restart_file);
}
}
}
}
printf ("n = %d\n", n);
printf ("\tcomputing ... \n"); fflush (stdout);
/* Loop while keeping an eye on the clock */
do {
if (timing) t1 = time(NULL);
sleep (5);
n++;
/* Check if the whole simulation has been finished:
this is our ``convergence'' criterion.
*/
if (n > LAST_N) finished = TRUE;
/* Check if we still have enough time for the next loop.
*/
if (timing) {
t2 = time(NULL);
loop_time = t2 - t1;
time_left = time_limit - (t2 - t0);
quit_time = time_left - loop_time - SAFETY_MARGIN;
printf ("\t\tn = %d, time left = %d seconds\n", n, time_left);
if ((quit_time <= 0) && (! finished))
printf ("\t\tRun out of time, exiting ... \n");
}
} while ((quit_time > 0) && (! finished));
printf ("\tdone.\n");
printf ("n = %d\n", n);
if (! (restart_name = getenv ("RSAVE_CHECKFILE"))) {
printf ("checkpointing not requested, exiting...\n");
exit (0);
}
else {
if (restart) {
strcpy (old_restart_name, restart_name);
strcat (old_restart_name, ".old");
printf ("renaming old restart file to %s\n", old_restart_name);
if (0 > rename (restart_name, old_restart_name)) {
perror (old_restart_name);
exit (4);
}
}
printf ("saving data on %s\n", restart_name);
if (! (restart_file = fopen (restart_name, "w"))) {
perror (restart_name);
exit (5);
}
else {
fprintf (restart_file, "%d\n", n);
fclose (restart_file);
}
if (! finished)
printf ("CONTINUE\n");
else
printf ("FINISHED\n");
}
exit (0);
}
Here is how this job is run. First I submit it with the environmental
variable RSAVE_RESTART unset, which initialises the job. Then I
set RSAVE_RESTART to yes and resubmit the job, which
restarts from where it left.
The job is allowed to run no longer than 30 seconds at a time. Given
the safety margin of 10 seconds and a single iteration time of 5
seconds this should let our program do 4 iterations. But the
while clause tests for quit_time > 0 not for
quit_time >= 0, so, in effect we end up with 3 iterations
instead of 4.
While the computational task remains unfinished, program rts
writes CONTINUE on standard output before it exits. But the
last run, when n becomes 31, is flagged with the word
FINISHED.
[gustav@bh1 rts]$ make
co RCS/Makefile,v Makefile
RCS/Makefile,v --> Makefile
revision 1.1
done
co RCS/rts.c,v rts.c
RCS/rts.c,v --> rts.c
revision 1.1
done
cc -c rts.c
cc -o rts rts.o
[gustav@bh1 rts]$ env | grep RSAVE
RSAVE_CHECKFILE=rts.dat
RSAVE_TIME_LIMIT=30
[gustav@bh1 rts]$ ./rts
Time for this job limited to 30 seconds.
Starting a new run.
n = 0
computing ...
n = 1, time left = 25 seconds
n = 2, time left = 20 seconds
n = 3, time left = 15 seconds
Run out of time, exiting ...
done.
n = 3
saving data on rts.dat
CONTINUE
[gustav@bh1 rts]$ export RSAVE_RESTART=yes
[gustav@bh1 rts]$ ./rts
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 3
computing ...
n = 4, time left = 25 seconds
n = 5, time left = 20 seconds
n = 6, time left = 15 seconds
Run out of time, exiting ...
done.
n = 6
renaming old restart file to rts.dat.old
saving data on rts.dat
CONTINUE
[gustav@bh1 rts]$ ./rts
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 6
computing ...
n = 7, time left = 25 seconds
n = 8, time left = 20 seconds
n = 9, time left = 15 seconds
Run out of time, exiting ...
done.
n = 9
renaming old restart file to rts.dat.old
saving data on rts.dat
CONTINUE
[gustav@bh1 rts]$
...
[gustav@bh1 rts]$ ./rts
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 24
computing ...
n = 25, time left = 25 seconds
n = 26, time left = 20 seconds
n = 27, time left = 15 seconds
Run out of time, exiting ...
done.
n = 27
renaming old restart file to rts.dat.old
saving data on rts.dat
CONTINUE
[gustav@bh1 rts]$ ./rts
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 27
computing ...
n = 28, time left = 25 seconds
n = 29, time left = 20 seconds
n = 30, time left = 15 seconds
Run out of time, exiting ...
done.
n = 30
renaming old restart file to rts.dat.old
saving data on rts.dat
CONTINUE
[gustav@bh1 rts]$ ./rts
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 30
computing ...
n = 31, time left = 25 seconds
done.
n = 31
renaming old restart file to rts.dat.old
saving data on rts.dat
FINISHED
[gustav@bh1 rts]$