next up previous index
Next: Restoring, Timing and Saving Up: Checkpointing and Resubmission Previous: Exercises

Restoring and Saving the State of a Job

In this section we shall discuss how to save and then restore the state of the computation between successive invocations of a program via PBS. The basic idea is that the only way any information can be transferred between successive invocations of a program is either

1.
through a file, or
2.
through an environmental variable, or
3.
through command line switches

Transferring data through a file is perhaps the most common practice. Using files you can transfer very large amounts of data: e.g., the whole state of a 3D flow, or the whole state of a protein, or the whole state of a car in a crash simulation. Basically, files can be used to transfer any information from one instantiation of a program to another, including even small items of information, such as whether the program should restart a computation from a previously reached state, or whether it should start a new computation.

Instead of writing on files, the program, in principle, can also write on user's environment. On the next invocation the program can check for existence and state of certain predefined environmental variables, and obtain required information that way. This method is good for transferring small amounts of information, e.g., the name of a checkpoint file, or the request to initialize a run, but not for very large data sets.

Of course, using environmental variables will not work if the variables themselves are not transmitted from one PBS process to another one. And because we cannot submit jobs directly from computational nodes on the AVIDD cluster and have to resort to the ssh kludge instead, transmitting the whole environment is going to be rather difficult. In section 4.3.4 we have used the -v option to qsub to transmit and set just one environmental variable and this did the trick there. We are going to resort to a similar technique here.

We are going to write the information about the name of the checkpoint file and whether the job should be continued at the end of a log file. After our application exits, the PBS script responsible for the execution of the application can inspect the log, and if it finds the instruction that the job should be continued, it can resubmit itself with appropriate values of certain environmental variables passed to its next instantiation by the means of the -v options. When the script is reincarnated by PBS, our application will begin by checking for the presence of these variables in the environment and for their content. From there it will learn if it should continue or reinitialize the computation, and if it should continue, where it should look for information about the state reached by the previous run.

The following listing shows a very simple C-language program which, if requested, reads the state of a computation from a file. If not requested it initializes a new computation. Then some further computation is performed and the new state is again saved on a file.

/*
 * Obtain information about the status of the computation and the name of the
 * checkpoint file (if present) from the environment. Open the checkpoint
 * file if the job is to be continued and read the state of the computation
 * from it. 
 *
 * Sleep for 5 seconds (to make the computation look more substantial)
 * and perform the computation itself. 
 *
 * If checkpointing has been requested then if the job has been restarted,
 * rename the old checkpoint file, then write the state of the computation
 * on a new checkpoint file. 
 *
 * %Id: rsave.c,v 1.1 2003/09/19 19:24:08 gustav Exp %
 * %Log: rsave.c,v %
 * Revision 1.1  2003/09/19 19:24:08  gustav
 * Initial revision
 *
 *
 */

#include <stdio.h>  /* has definitions of printf, fprintf, fopen, fscanf,
                       fclose, fflush, perror, rename and BUFSIZ */
#include <stdlib.h> /* has definitions of getenv and exit */
#include <unistd.h> /* has definition of sleep */
#include <string.h> /* has definitions of strcpy and strcat */

main()
{
  char *restart_name, *restart, old_restart_name[BUFSIZ];
  FILE *restart_file;
  int n;

  /* Is this a continued job or a new one? */

  if (! (restart = getenv ("RSAVE_RESTART"))) {
    printf ("Starting a new run.\n");
    n = 0;
  }
  else {
    if (! (restart_name = getenv ("RSAVE_CHECKFILE"))) {
      fprintf (stderr, "error: no checkpoint file for the restart job\n");
      exit (1);
    }
    else {
      printf ("Restarting the job from %s.\n", restart_name);
      if (! (restart_file = fopen(restart_name, "r"))) {
        perror (restart_name);
        exit (2);
      }
      else {
        if (! (fscanf (restart_file, "%d", &n) > 0)) {
          fprintf (stderr, "%s: input file format error\n", restart_name);
          exit (3);
        }
        else {
          fclose (restart_file);
        }
      }          
    }
  }

  printf ("n = %d\n", n);
  printf ("\tcomputing ... "); fflush (stdout);
  sleep (5);
  n++;
  printf ("done.\n");
  printf ("n = %d\n", n);

  if (! (restart_name = getenv ("RSAVE_CHECKFILE"))) {
    printf ("checkpointing not requested, exiting...\n");
    exit (0);
  }
  else {
    if (restart) {
      strcpy (old_restart_name, restart_name);
      strcat (old_restart_name, ".old");
      printf ("renaming old restart file to %s\n", old_restart_name);
      if (0 > rename (restart_name, old_restart_name)) {
        perror (old_restart_name);
        exit (4);
      }
    }
    printf ("saving data on %s\n", restart_name);
    if (! (restart_file = fopen (restart_name, "w"))) {
      perror (restart_name);
      exit (5);
    }
    else {
      fprintf (restart_file, "%d\n", n);
      fclose (restart_file);
    }
  }
  exit (0);
}
I'll explain how this program works in detail below, but first let's just see what it does:
[gustav@bh1 rsave]$ make
co  RCS/Makefile,v Makefile
RCS/Makefile,v  -->  Makefile
revision 1.1
done
co  RCS/rsave.c,v rsave.c
RCS/rsave.c,v  -->  rsave.c
revision 1.1
done
cc -c rsave.c
cc -o rsave rsave.o
[gustav@bh1 rsave]$ env | grep RSAVE
[gustav@bh1 rsave]$ export RSAVE_CHECKFILE=rsave.dat
[gustav@bh1 rsave]$ ./rsave
Starting a new run.
n = 0
        computing ... done.
n = 1
saving data on rsave.dat
[gustav@bh1 rsave]$ export RSAVE_RESTART=yes
[gustav@bh1 rsave]$ ./rsave
Restarting the job from rsave.dat.
n = 1
        computing ... done.
n = 2
renaming old restart file to rsave.dat.old
saving data on rsave.dat
[gustav@bh1 rsave]$ cat rsave.dat
2
[gustav@bh1 rsave]$ ./rsave
Restarting the job from rsave.dat.
n = 2
        computing ... done.
n = 3
renaming old restart file to rsave.dat.old
saving data on rsave.dat
[gustav@bh1 rsave]$ ./rsave
Restarting the job from rsave.dat.
n = 3
        computing ... done.
n = 4
renaming old restart file to rsave.dat.old
saving data on rsave.dat
[gustav@bh1 rsave]$
Here is the promised explanation of the program in detail.

The first thing that the program does, is to check for the existence of the environmental  variable RSAVE_RESTART. If the variable does not exist, the program starts a new run and initialises n to 0.

If the variable RSAVE_RESTART exists (it doesn't really matter what is its value) then we first check if another variable, which should specify the name of the checkpoint file, RSAVE_CHECKFILE, exists too. If it doesn't then we have no way to find the name of the checkpoint file. So in that case we print an error message, flag an error on exit (value 1) and exit.

If the variable RSAVE_CHECKFILE exists then we use its value as the name of the checkpoint file, print a message about restarting the job from that file and attempt to open it for reading.

If for some reason the file cannot be opened, we print the diagnostic on standard output (with perror), flag an error (value 2) and exit.

If the file has been opened without problems we try to read an integer number  from it. That integer is the whole object of our simple computation in this program and it represents the state of the computation.

It may happen that for some reason the checkpoint file does not contain that integer. In that case we print the corresponding error message, flag an error (value 3) and exit.

But if everything goes well, by this time we should have our state of the system in hand, so we close the checkpoint file (in case of an error exit the file would be closed automatically) and commence the computation.

The computation is quite trivial. We simply increment the integer read from the file by 1. In order to add a little more body to the program we also sleep for 5 seconds (this is called putting on weight). We will need that sleep in our next example, which will combine timing with saving and restoring.

Once the computation is finished we again check the environmental variable RSAVE_CHECKFILE. Observe that this variable has not been looked up so far by the branch of the program, that does the initialisation. That is why we do it here again, even though the other branch, which is responsible for the restarting of the job, would have looked it up already.

If the variable RSAVE_CHECKFILE is not defined, we write the message that ``checkpointing has not been requested'' and exit. No error condition is flagged this time.

If the variable RSAVE_CHECKFILE exists, and if the job is a restarted one, then we attempt to rename  the  original  restart file to whatever its old name was with a suffix .old appended.

If for some reason that cannot be done, we print a diagnostic on standard error using perror, flag an error (value 4) and exit.

Otherwise, having renamed the old restart file, we attempt to open, this time for writing, a new file bearing the old name. If for some reason that cannot be done a diagnostic is printed on standard error with perror, an error exit is flagged (value 5) and the program aborts.

Otherwise, i.e., if all went well and we have the new restart file opened, we write the new value of n on it, close it, and exit with status 0.

This is really quite simple stuff. Whatever complexity there is in the presented example, it derives from my attempt to make the program robust. Regardless of whether variables RSAVE_RESTART and RSAVE_CHECKFILE exist, regardless of whether the data file itself exists, the program should always do something more or less sensible, write meaningful error messages if need be, and exit gracefully conveying a meaningful exit value to the shell. For seasoned C and C++ programmers all this is just bread and butter.


next up previous index
Next: Restoring, Timing and Saving Up: Checkpointing and Resubmission Previous: Exercises
Zdzislaw Meglicki
2004-04-29