next up previous index
Next: Restoring and Saving in Up: Restoring and Saving the Previous: Restoring and Saving the

Restoring and Saving in C

The following listing shows a very simple C-language program which, if requested, reads the state of computation from a file. If not requested it initialises a new computation. Then some further computation is performed and the new state is again saved on a file.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

main()
{
  char *restart_name, *restart, old_restart_name[BUFSIZ];
  FILE *restart_file;
  int n;

  /* Is this a continued job or a new one? */

  if (! (restart = getenv ("RSAVE_RESTART"))) {
    printf ("Starting a new run.\n");
    n = 0;
  }
  else {
    if (! (restart_name = getenv ("RSAVE_CHECKFILE"))) {
      fprintf (stderr, "error: no checkpoint file for the restart job\n");
      exit (1);
    }
    else {
      printf ("Restarting the job from %s.\n", restart_name);
      if (! (restart_file = fopen(restart_name, "r"))) {
        perror (restart_name);
        exit (2);
      }
      else {
        if (! (fscanf (restart_file, "%d", &n) > 0)) {
          fprintf (stderr, "%s: input file format error\n", restart_name);
          exit (3);
        }
        else {
          fclose (restart_file);
        }
      }          
    }
  }

  printf ("n = %d\n", n);
  printf ("\tcomputing ... "); fflush (stdout);
  sleep (5);
  n++;
  printf ("done.\n");
  printf ("n = %d\n", n);

  if (! (restart_name = getenv ("RSAVE_CHECKFILE"))) {
    printf ("checkpointing not requested, exiting...\n");
    exit (0);
  }
  else {
    if (restart) {
      strcpy (old_restart_name, restart_name);
      strcat (old_restart_name, ".old");
      printf ("renaming old restart file to %s\n", old_restart_name);
      if (0 > rename (restart_name, old_restart_name)) {
        perror (old_restart_name);
        exit (4);
      }
    }
    printf ("saving data on %s\n", restart_name);
    if (! (restart_file = fopen (restart_name, "w"))) {
      perror (restart_name);
      exit (5);
    }
    else {
      fprintf (restart_file, "%d\n", n);
      fclose (restart_file);
    }
  }
  exit (0);
}
I'll explain how this program works in detail below, but first let's just see what it does:
gustav@sp19:../LoadLeveler 13:39:41 !516 $ env | grep RSAVE
RSAVE_CHECKFILE=rsave.dat
RSAVE_RESTART=yes
gustav@sp19:../LoadLeveler 13:39:45 !517 $ unset RSAVE_RESTART
gustav@sp19:../LoadLeveler 13:39:51 !518 $ ./rsave
Starting a new run.
n = 0
        computing ... done.
n = 1
saving data on rsave.dat
gustav@sp19:../LoadLeveler 13:40:03 !519 $ export RSAVE_RESTART="yes"
gustav@sp19:../LoadLeveler 13:40:14 !520 $ ./rsave
Restarting the job from rsave.dat.
n = 1
        computing ... done.
n = 2
renaming old restart file to rsave.dat.old
saving data on rsave.dat
gustav@sp19:../LoadLeveler 13:40:22 !521 $ cat rsave.dat
2
gustav@sp19:../LoadLeveler 13:40:29 !522 $ ./rsave
Restarting the job from rsave.dat.
n = 2
        computing ... done.
n = 3
renaming old restart file to rsave.dat.old
saving data on rsave.dat
gustav@sp19:../LoadLeveler 13:40:51 !523 $ ./rsave
Restarting the job from rsave.dat.
n = 3
        computing ... done.
n = 4
renaming old restart file to rsave.dat.old
saving data on rsave.dat
gustav@sp19:../LoadLeveler 13:41:28 !524 $

Here is the promised explanation of the program in detail.

The first thing that the program does, is to check for the existence of the environmental variable RSAVE_RESTART. If the variable does not exist, the program starts a new run and initialises n to 0.

If the variable RSAVE_RESTART exists (it doesn't really matter what is its value) then we first check if another variable, which should specify the name of the checkpoint file, RSAVE_CHECKFILE, exists too. If it doesn't, then we have no way to find the name of the checkpoint file. So in that case we print an error message, flag an error on exit (value 1) and exit.

If the variable RSAVE_CHECKFILE exists then we use its value as the name of the checkpoint file, print a message about restarting the job from that file, and attempt to open it for reading.

If for some reason the file cannot be opened, we print the diagnostic on standard output (with perror), flag an error (value 2) and exit.

If the file has been opened without problems we try to read an integer number from it. That integer is the whole object of our simple computation in this program and it represents the state of the system.

It may happen that for some reason the checkpoint file does not contain that integer. In that case we print the corresponding error message, flag an error (value 3) and exit.

But if everything goes well, by this time we should have our state of the system in hand, so we close the checkpoint file (in case of an error exit the file would be closed automatically) and commence the computation.

The computation is quite trivial. We simply increment the integer read from the file by 1. In order to add a little more body to the program we also sleep for 5 seconds (this is called putting on weight). We will need that sleep in our next example, which will combine timing with saving and restoring.

Once the computation is finished we again check the environmental variable RSAVE_CHECKFILE. Observe that this variable has not been looked up so far by the branch of the program, that does the initialisation. That is why we do it here again, even though the other branch, which is responsible for the restarting of the job, would have looked it up already.

If the variable RSAVE_CHECKFILE is not defined, we write the message that ``checkpointing has not been requested'' and exit. No error condition is flagged this time.

If the variable RSAVE_CHECKFILE exists, and if the job is a restarted one, then we attempt to rename the original restart file to whatever its old name was with a suffix .old appended.

If for some reason that cannot be done, we print diagnostic on standard error using perror, flag an error (value 4) and exit.

Otherwise, having renamed the old restart file, we attempt to open, this time for writing, a new file bearing the old name. If for some reason that cannot be done a diagnostic is printed on standard error with perror, an error exit is flagged (value 5) and the program aborts.

Otherwise, i.e., if all went well and we have the new restart file opened, we write the new value of n on it, close it, and exit with status 0.

This is really quite simple stuff. Whatever complexity there is in the presented example, it derives from my attempt to make the program robust. Regardless of whether variables RSAVE_RESTART and RSAVE_CHECKFILE exist, regardless of whether the data file itself exists, the program should always do something more or less sensible, write meaningful error messages if need be, and exit gracefully conveying a meaningful exit value to the shell. For seasoned C and C++ programmers all that is just bread and butter.


next up previous index
Next: Restoring and Saving in Up: Restoring and Saving the Previous: Restoring and Saving the
Zdzislaw Meglicki
2001-02-26