next up previous index
Next: The Complete Application in Up: Restoring, Timing, and Saving Previous: Restoring, Timing, and Saving

The Complete Application in C

Here is the C version of the program. The wall-clock time is measured using the UNIX function time. The parameters LAST_N and SAFETY_MARGIN have been implemented as cpp constants. I could also read them from the environment, a command line, or from an input file, but that would clutter the example.

The program always executes the statements of the do ... while loop at least once, because the exit condition is tested at the end of the loop.

Observe that the variable quit_time is initialised to 1l. That way, if timing is not requested, it remains always positive and the while test fires up only when the job is finished. Furthermore the variable timing is initialised to TRUE, and becomes FALSE only if there is no environmental variable RSAVE_TIME_LIMIT. The job is assumed to be unfinished on entry (the variable finished is initialised to FALSE) and becomes finished only when n becomes greater than LAST_N. This means that once n becomes greater than LAST_N, you can still submit the job and it will always increment n by 1 before exiting.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#ifndef TRUE
# define TRUE 1
#endif
#ifndef FALSE
# define FALSE 0
#endif

#ifndef LAST_N
# define LAST_N 30
#endif

#ifndef SAFETY_MARGIN
# define SAFETY_MARGIN 10
#endif

main()
{
  char *restart_name, *restart, old_restart_name[BUFSIZ];
  FILE *restart_file;
  int n, finished = FALSE, timing = TRUE;
  time_t t0, t1, t2, loop_time, time_left, time_limit, quit_time = 1l;
  char *time_limit_string;

  /* Check the clock at the beginning of the run */
  t0 = time(NULL);

  /* Check how much time we have for this job */
  if (! (time_limit_string = getenv ("RSAVE_TIME_LIMIT"))) {
    printf ("Unlimited time for this job.\n");
    timing = FALSE;
  }
  else {
    if (! (0 < sscanf (time_limit_string, "%d", &time_limit))) {
      fprintf (stderr, "Error: bad format of RSAVE_TIME_LIMIT\n");
      exit (1);
    }
    else {
      printf ("Time for this job limited to %d seconds.\n", time_limit);
    }
  }
    
  /* Is this a continued job or a new one? */

  if (! (restart = getenv ("RSAVE_RESTART"))) {
    printf ("Starting a new run.\n");
    n = 0;
  }
  else {
    if (! (restart_name = getenv ("RSAVE_CHECKFILE"))) {
      fprintf (stderr, "error: no checkpoint file for the restart job\n");
      exit (1);
    }
    else {
      printf ("Restarting the job from %s.\n", restart_name);
      if (! (restart_file = fopen(restart_name, "r"))) {
        perror (restart_name);
        exit (2);
      }
      else {
        if (! (fscanf (restart_file, "%d", &n) > 0)) {
          fprintf (stderr, "%s: input file format error\n", restart_name);
          exit (3);
        }
        else {
          fclose (restart_file);
        }
      }          
    }
  }

  printf ("n = %d\n", n);
  printf ("\tcomputing ... \n"); fflush (stdout);

  /* Loop while keeping an eye on the clock */

  do {
    if (timing) t1 = time(NULL);

    sleep (5);
    n++;

    /* Check if the whole simulation has been finished: 
       this is our ``convergence'' criterion. 
       */
    if (n > LAST_N) finished = TRUE;

    /* Check if we still have enough time for the next loop.
       */
    if (timing) {
       t2 = time(NULL);
       loop_time = t2 - t1;
       time_left = time_limit - (t2 - t0);
       quit_time = time_left - loop_time - SAFETY_MARGIN;
       printf ("\t\tn = %d, time left = %d seconds\n", n, time_left);
       if ((quit_time <= 0) && (! finished))
         printf ("\t\tRun out of time, exiting ... \n");    
    }
  } while ((quit_time > 0) && (! finished));

  printf ("\tdone.\n");
  printf ("n = %d\n", n);

  if (! (restart_name = getenv ("RSAVE_CHECKFILE"))) {
    printf ("checkpointing not requested, exiting...\n");
    exit (0);
  }
  else {
    if (restart) {
      strcpy (old_restart_name, restart_name);
      strcat (old_restart_name, ".old");
      printf ("renaming old restart file to %s\n", old_restart_name);
      if (0 > rename (restart_name, old_restart_name)) {
        perror (old_restart_name);
        exit (4);
      }
    }
    printf ("saving data on %s\n", restart_name);
    if (! (restart_file = fopen (restart_name, "w"))) {
      perror (restart_name);
      exit (5);
    }
    else {
      fprintf (restart_file, "%d\n", n);
      fclose (restart_file);
    }
    if (! finished)
      printf ("CONTINUE\n");
    else
      printf ("FINISHED\n");
  }
  exit (0);
}

Here is how this job is run. First I submit it with the environmental variable RSAVE_RESTART unset, which initialises the job. Then I set RSAVE_RESTART to yes and resubmit the job, which restarts from where it left.

The job is allowed to run no longer than 30 seconds at a time. Given the safety margin of 10 seconds and a single iteration time of 5 seconds this should let our program do 4 iterations. But the while clause tests for quit_time > 0 not for quit_time >= 0, so, in effect we end up with 3 iterations instead of 4.

While the computational task remains unfinished, program rts writes CONTINUE on standard output before it exits. But the last run, when n becomes 31, is flagged with the word FINISHED.

gustav@sp19:../LoadLeveler 14:33:15 !556 $ gcc -o rts rts.c
gustav@sp19:../LoadLeveler 14:33:25 !557 $ env | grep RSAVE
RSAVE_TIME_LIMIT=30
RSAVE_CHECKFILE=rts.dat
RSAVE_RESTART=yes
gustav@sp19:../LoadLeveler 14:33:29 !558 $ unset RSAVE_RESTART
gustav@sp19:../LoadLeveler 14:33:36 !559 $ ./rts
Time for this job limited to 30 seconds.
Starting a new run.
n = 0
        computing ... 
                n = 1, time left = 25 seconds
                n = 2, time left = 20 seconds
                n = 3, time left = 15 seconds
                Run out of time, exiting ... 
        done.
n = 3
saving data on rts.dat
CONTINUE
gustav@sp19:../LoadLeveler 14:33:53 !560 $ export RSAVE_RESTART="yes"
gustav@sp19:../LoadLeveler 14:34:05 !561 $ ./rts
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 3
        computing ... 
                n = 4, time left = 25 seconds
                n = 5, time left = 20 seconds
                n = 6, time left = 15 seconds
                Run out of time, exiting ... 
        done.
n = 6
renaming old restart file to rts.dat.old
saving data on rts.dat
CONTINUE
gustav@sp19:../LoadLeveler 14:34:25 !562 $ 

...

gustav@sp19:../LoadLeveler 14:38:19 !569 $ ./rts
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 27
        computing ... 
                n = 28, time left = 25 seconds
                n = 29, time left = 20 seconds
                n = 30, time left = 15 seconds
                Run out of time, exiting ... 
        done.
n = 30
renaming old restart file to rts.dat.old
saving data on rts.dat
CONTINUE
gustav@sp19:../LoadLeveler 14:38:47 !570 $ ./rts
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 30
        computing ... 
                n = 31, time left = 25 seconds
        done.
n = 31
renaming old restart file to rts.dat.old
saving data on rts.dat
FINISHED
gustav@sp19:../LoadLeveler 14:39:11 !571 $


next up previous index
Next: The Complete Application in Up: Restoring, Timing, and Saving Previous: Restoring, Timing, and Saving
Zdzislaw Meglicki
2001-02-26