next up previous index
Next: Combining the Application with Up: Checkpointing and Resubmission Previous: Restoring and Saving the

Restoring, Timing and Saving a Job: the Complete Application

In this section we shall combine job timing with job restoring and saving, and produce a complete application, which, in the next section, will be combined with a PBS script, so as to produce an automatically resubmitting job.

The program is a slight modification of our restore and save example. There are no really new elements here, which would require a broader explanation.

The additional logic that is laid out on top of the restore and save example is as follows.

We begin by checking for a new environmental variable, RSAVE_TIME_LIMIT. If that variable does not exist then we assume that time allowed for this job is unlimited and things work more or less as before. If the variable exists then we attempt to read its value assuming that it is going to be a number. If it is not a number we print an error message and exit. If it is a number then the number is assigned to variable time_limit and assumed to represent the number of wall-clock seconds allocated to this job.

As I have mentioned before, what really matters to other users and to the system administrators is how long they have to wait, in terms of wall-clock time, until your job gets out of the way. For this reason I use the wall-clock timer, i.e., function time. However, you can modify the programs easily to look up the CPU time instead.

Once the information about the time limit is obtained we proceed exactly as before, until we get to the part of the program that does the computation. Instead of just incrementing number n and sleeping for 5 seconds, we enter a loop.

If no timing has been requested the loop keeps incrementing n and sleeping, until n becomes greater than LAST_N. The latter is an arbitrary constant, which in our toy example represents something like a convergence criterion. Once the ``convergence'' has been reached, the finished flag is set to TRUE and the loop exits.

Things are more interesting if timing of the job has been requested (by setting the environmental variable RSAVE_TIME_LIMIT to some number of seconds). In that case we measure time taken by one iteration of the loop, and we check how much time there is still left after the iteration has finished. If there is still enough time to perform another iteration we continue, if not, the loop exits.

Because saving the data, cleaning up, and executing the PBS script may take additional time we have to include a SAFETY_MARGIN while calculating time that still remains. In this case we set SAFETY_MARGIN to 10 seconds, but if you have to save a very large data set, you should probably reserve a couple of minutes.

Flagging the resubmission is accomplished as follows.

Before exiting, we check if the whole job is finished, which it will be once our ``convergence'' criterion is satisfied. If the job is finished we write FINISHED on standard output. Otherwise we write CONTINUE.

If the standard output has been logged on a file, after the program exits, the PBS script can inspect the log, and resubmit itself if it finds the word CONTINUE in the log.

Here is the program. The parameters LAST_N and SAFETY_MARGIN have been implemented as cpp constants. I could also read them from the environment, a command line, or from an input file, but that would clutter the example.

The program always executes the statements of the do ... while loop at least once, because the exit condition is tested at the end of the loop.

Observe that the variable quit_time is initialised to 1l. That way, if timing is not requested, it remains always positive and the while test fires up only when the job is finished. Furthermore the variable timing is initialised to TRUE, and becomes FALSE only if there is no environmental variable RSAVE_TIME_LIMIT. The job is assumed to be unfinished on entry (the variable finished is initialised to FALSE) and becomes finished only when n becomes greater than LAST_N. This means that once n becomes greater than LAST_N, you can still submit the job and it will always increment n by 1 before exiting.

/*
 * rts: Restart Time and Save
 *
 * %Id: rts.c,v 1.1 2003/09/19 20:55:37 gustav Exp %
 * 
 * %Log: rts.c,v %
 * Revision 1.1  2003/09/19 20:55:37  gustav
 * Initial revision
 *
 *
 */

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <time.h>

#ifndef TRUE
# define TRUE 1
#endif
#ifndef FALSE
# define FALSE 0
#endif

#ifndef LAST_N
# define LAST_N 30
#endif

#ifndef SAFETY_MARGIN
# define SAFETY_MARGIN 10
#endif

main()
{
  char *restart_name, *restart, old_restart_name[BUFSIZ];
  FILE *restart_file;
  int n, finished = FALSE, timing = TRUE;
  time_t t0, t1, t2, loop_time, time_left, time_limit, quit_time = 1l;
  char *time_limit_string;

  /* Check the clock at the beginning of the run */
  t0 = time(NULL);

  /* Check how much time we have for this job */
  if (! (time_limit_string = getenv ("RSAVE_TIME_LIMIT"))) {
    printf ("Unlimited time for this job.\n");
    timing = FALSE;
  }
  else {
    if (! (0 < sscanf (time_limit_string, "%d", &time_limit))) {
      fprintf (stderr, "Error: bad format of RSAVE_TIME_LIMIT\n");
      exit (1);
    }
    else {
      printf ("Time for this job limited to %d seconds.\n", time_limit);
    }
  }
    
  /* Is this a continued job or a new one? */

  if (! (restart = getenv ("RSAVE_RESTART"))) {
    printf ("Starting a new run.\n");
    n = 0;
  }
  else {
    if (! (restart_name = getenv ("RSAVE_CHECKFILE"))) {
      fprintf (stderr, "error: no checkpoint file for the restart job\n");
      exit (1);
    }
    else {
      printf ("Restarting the job from %s.\n", restart_name);
      if (! (restart_file = fopen(restart_name, "r"))) {
        perror (restart_name);
        exit (2);
      }
      else {
        if (! (fscanf (restart_file, "%d", &n) > 0)) {
          fprintf (stderr, "%s: input file format error\n", restart_name);
          exit (3);
        }
        else {
          fclose (restart_file);
        }
      }          
    }
  }

  printf ("n = %d\n", n);
  printf ("\tcomputing ... \n"); fflush (stdout);

  /* Loop while keeping an eye on the clock */

  do {
    if (timing) t1 = time(NULL);

    sleep (5);
    n++;

    /* Check if the whole simulation has been finished: 
       this is our ``convergence'' criterion. 
    */
    if (n > LAST_N) finished = TRUE;

    /* Check if we still have enough time for the next loop.
     */
    if (timing) {
      t2 = time(NULL);
      loop_time = t2 - t1;
      time_left = time_limit - (t2 - t0);
      quit_time = time_left - loop_time - SAFETY_MARGIN;
      printf ("\t\tn = %d, time left = %d seconds\n", n, time_left);
      if ((quit_time <= 0) && (! finished))
	printf ("\t\tRun out of time, exiting ... \n");    
    }
  } while ((quit_time > 0) && (! finished));

  printf ("\tdone.\n");
  printf ("n = %d\n", n);

  if (! (restart_name = getenv ("RSAVE_CHECKFILE"))) {
    printf ("checkpointing not requested, exiting...\n");
    exit (0);
  }
  else {
    if (restart) {
      strcpy (old_restart_name, restart_name);
      strcat (old_restart_name, ".old");
      printf ("renaming old restart file to %s\n", old_restart_name);
      if (0 > rename (restart_name, old_restart_name)) {
        perror (old_restart_name);
        exit (4);
      }
    }
    printf ("saving data on %s\n", restart_name);
    if (! (restart_file = fopen (restart_name, "w"))) {
      perror (restart_name);
      exit (5);
    }
    else {
      fprintf (restart_file, "%d\n", n);
      fclose (restart_file);
    }
    if (! finished)
      printf ("CONTINUE\n");
    else
      printf ("FINISHED\n");
  }
  exit (0);
}

Here is how this job is run. First I submit it with the environmental variable RSAVE_RESTART unset, which initialises the job. Then I set RSAVE_RESTART to yes and resubmit the job, which restarts from where it left.

The job is allowed to run no longer than 30 seconds at a time. Given the safety margin of 10 seconds and a single iteration time of 5 seconds this should let our program do 4 iterations. But the while clause tests for quit_time > 0 not for quit_time >= 0, so, in effect we end up with 3 iterations instead of 4.

While the computational task remains unfinished, program rts writes CONTINUE on standard output before it exits. But the last run, when n becomes 31, is flagged with the word FINISHED.

[gustav@bh1 rts]$ make
co  RCS/Makefile,v Makefile
RCS/Makefile,v  -->  Makefile
revision 1.1
done
co  RCS/rts.c,v rts.c
RCS/rts.c,v  -->  rts.c
revision 1.1
done
cc -c rts.c
cc -o rts rts.o
[gustav@bh1 rts]$ env | grep RSAVE
RSAVE_CHECKFILE=rts.dat
RSAVE_TIME_LIMIT=30
[gustav@bh1 rts]$ ./rts
Time for this job limited to 30 seconds.
Starting a new run.
n = 0
        computing ... 
                n = 1, time left = 25 seconds
                n = 2, time left = 20 seconds
                n = 3, time left = 15 seconds
                Run out of time, exiting ... 
        done.
n = 3
saving data on rts.dat
CONTINUE
[gustav@bh1 rts]$ export RSAVE_RESTART=yes
[gustav@bh1 rts]$ ./rts
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 3
        computing ... 
                n = 4, time left = 25 seconds
                n = 5, time left = 20 seconds
                n = 6, time left = 15 seconds
                Run out of time, exiting ... 
        done.
n = 6
renaming old restart file to rts.dat.old
saving data on rts.dat
CONTINUE
[gustav@bh1 rts]$ ./rts
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 6
        computing ... 
                n = 7, time left = 25 seconds
                n = 8, time left = 20 seconds
                n = 9, time left = 15 seconds
                Run out of time, exiting ... 
        done.
n = 9
renaming old restart file to rts.dat.old
saving data on rts.dat
CONTINUE
[gustav@bh1 rts]$ 

...

[gustav@bh1 rts]$ ./rts
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 24
        computing ... 
                n = 25, time left = 25 seconds
                n = 26, time left = 20 seconds
                n = 27, time left = 15 seconds
                Run out of time, exiting ... 
        done.
n = 27
renaming old restart file to rts.dat.old
saving data on rts.dat
CONTINUE
[gustav@bh1 rts]$ ./rts
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 27
        computing ... 
                n = 28, time left = 25 seconds
                n = 29, time left = 20 seconds
                n = 30, time left = 15 seconds
                Run out of time, exiting ... 
        done.
n = 30
renaming old restart file to rts.dat.old
saving data on rts.dat
CONTINUE
[gustav@bh1 rts]$ ./rts
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 30
        computing ... 
                n = 31, time left = 25 seconds
        done.
n = 31
renaming old restart file to rts.dat.old
saving data on rts.dat
FINISHED
[gustav@bh1 rts]$


next up previous index
Next: Combining the Application with Up: Checkpointing and Resubmission Previous: Restoring and Saving the
Zdzislaw Meglicki
2004-04-29