How to Time, Save, and Resubmit your LoadLeveler Jobs

by
Zdzislaw Meglicki

18th December 1996 -- 2nd January 1997

Introduction

This article discusses timing, checkpointing, and resubmission of LoadLeveler jobs. Rather than relying on a special LoadLeveler mechanism for checkpointing jobs, which requires linking your program with LoadLeveler libraries, and which does not work for parallel jobs, and for other batch queueing systems, such as NQS, here I demonstrate how you can easily implement your own timing, checkpointing, and job resubmission mechanism in C, Fortran 90, and in Common Lisp.

The procedures discussed in this article are not limited to LoadLeveler. They should work for any batch submission system, as long as the batch jobs are described in terms of shell scripts, and as long as the system in question is IEEE-1003 (POSIX) compliant. They are applicable both to sequential and parallel jobs.

This article is addressed primarily to program developers at a, say, just out of the school stage. Experienced developers know how to do these things, so they don't need to read this article, unless they wish to have a good laugh and blaze me off the net for clobbering it with junk. But if you haven't done such things before, and if you would like your programs to survive the inevitable demise of untimed queues at our facility, then read on.

If you use an off-the-shelf application, then this article is not for you either. You will have to resort to other techniques, which you may wish to discuss with us. What can and cannot be done in that context depends on the application in question and on your problem. Some well known applications, like Gaussian, provide their own elaborate checkpointing mechanisms.

There are four issues that need to be addressed when automatically checkpointing and resubmitting your LoadLeveler jobs.

  1. Timing the job:
    Your job must know how much CPU or wall-clock time it used so far, and how much time there is still left.
  2. Saving the state of the job:
    This usually involves dumping a data file which contains an essential summary of the state of the system that is being computed. That file will be read when the job is restarted, and computation will commence from the point reached when the file has been dumped.
  3. Informing the parent process (usually a shell) that the computation should be continued:
    This can be done, for example, by exiting the job with a non-zero exit status. Alternatively you could write a specific message on a log file (to be searched for by the shell script when the job exits) or create an empty flag file.
  4. Resubmitting the job:
    Depending on whether the job should be continued, the LoadLeveler script, before exiting, should either
    1. resubmit itself, possibly with certain new flags or variables set up, or
    2. clean up and inform the user that the computation has been completed.


Timing a Job

Go to

Timing a Job in C

Probably most C-language programmers know how to time their jobs, because functions time and clock are parts of the standard C library, which is defined by ANSI C specifications.

Function time takes a pointer to time_t as an argument and returns a value of time_t on exit. On our system time_t is defined on /usr/include/sys/types.h and /usr/include/time.h as long. If the pointer is not NULL, the return value is also placed in whatever location the pointer points at. The returned value is the current calendar time, in seconds, since the Epoch, i.e., 00:00:00 GMT, 1st of January 1970: popularly celebrated as the day when UNIX was born.

UNIX was actually born in 1969, but the whole thing is a bit like Christmas. Nobody really knows when Jesus was born, but it's almost certain that it was not on Christmas Day. Christmas Day is really the old Roman Festival of Saturnalia (17th through 24th of December), which had been converted to a Christian holiday by Pope Sylvester I, who was Pope during the reign of emperor Constantine I (307-337, see left). The first recorded celebration of Christmas took place in Rome in 336.

I am grateful for this information to Michael DiMaio from Department of Philosophy. Salve Regina University, Newport, Rhode Island; and to Robert Heckendorn from Hewlett-Packard in Fort Collins, Colorado.

Anyhow, to sum up, you would use function time in order to find out about the elapsed wall-clock time. If you know that, say, a two_hour_dedicated queue allows only up to two wall-clock hours (7200 seconds) per job, by checking how much time you've used so far, you will know how much time there is still left too.

Function clock does not take any arguments and returns a value ot type clock_t, which is defined on /usr/include/sys/types.h and /usr/include/time.h as int. This function returns CPU time that elapsed since the execution of the program commenced. The returned time is not in seconds. It is in clock cycles. There is a constant CLOCKS_PER_SEC defined on /usr/include/time.h, which tells how many clock cycles there are per second. So, in order to find out how many CPU seconds you have used so far you have to divide the result obtained by calling clock by CLOCKS_PER_SEC.

Because function clock returns clock_t, i.e., int (on AIX), it should be called frequently. Once the returned value reaches MAX_INT, which is defined on /usr/include/values.h, the clock resets itself and resumes counting from 0.

The following example illustrates how to use functions time and clock.

#include <sys/types.h>
#include <time.h>
#include <unistd.h>
#include <stdio.h>

main()
{
  time_t  t0, t1; /* time_t is defined on <time.h> and <sys/types.h> as long */
  clock_t c0, c1; /* clock_t is defined on <time.h> and <sys/types.h> as int */

  long count;
  double a, b, c;

  printf ("using UNIX function time to measure wallclock time ... \n");
  printf ("using UNIX function clock to measure CPU time ... \n");

  t0 = time(NULL);
  c0 = clock();

  printf ("\tbegin (wall):            %ld\n", (long) t0);
  printf ("\tbegin (CPU):             %d\n", (int) c0);

  printf ("\t\tsleep for 5 seconds ... \n");
  sleep(5);

  printf ("\t\tperform some computation ... \n");
  for (count = 1l; count < 10000000l; count++) {
     a = sqrt(count);
     b = 1.0/a;
     c = b - a;
  }

  t1 = time(NULL);
  c1 = clock();

  printf ("\tend (wall):              %ld\n", (long) t1);
  printf ("\tend (CPU);               %d\n", (int) c1);
  printf ("\telapsed wall clock time: %ld\n", (long) (t1 - t0));
  printf ("\telapsed CPU time:        %f\n", (float) (c1 - c0)/CLOCKS_PER_SEC);
}

Compile this program with

gustav@s1n01:~/src/try 232 $ cc -o c-time c-time.c -lm
gustav@s1n01:~/src/try 233 $ 
and run it as follows
gustav@s1n01:~/src/try 233 $ time ./c-time
using UNIX function time to measure wallclock time ... 
using UNIX function clock to measure CPU time ... 
        begin (wall):            850891598
        begin (CPU):             0
                sleep for 5 seconds ... 
                perform some computation ... 
        end (wall):              850891641
        end (CPU);               34120000
        elapsed wall clock time: 43
        elapsed CPU time:        34.120000

Real   42.31
User   34.12
System 0.01
gustav@s1n01:~/src/try 234 $ 
Observe that times returned by this program agree with times returned by the UNIX command time, which, although it should not be surprising, is always a source of great joy and bewilderment. Well, it's a Yuletide season after all...

While speaking of Christmas, and it is Christmas Eve right now, while I'm working on this section, observe that the word ``Yuletide'' derives from the Old English word ``geola'', which was the name of a Germanic pagan feast lasting 12 days. That holiday was not related to the Roman Saturnalia, which lasted 7 days.

Go to

Timing a Job in Fortran

This section will be somewhat longer and more complicated than the previous section about timing the job in C. The reason for that is that Fortran programmers are often less acquainted with the new Fortran 90 features that let them time their programs portably. On the other hand, even in Fortran 90 programs cannot check CPU time usage without calling vendor and system specific functions.

Fortran 77 was a very primitive language and there was no portable way to check either a wall-clock or a CPU time from within F77 programs at all. Luckily, those days are over. Nobody in his or her right mind would write a new program in F77 nowadays, because F90 is a lot more expressive and it is now becoming increasingly available on most computer platforms. Furthermore F90's close association with High Performance Fortran offers painless, efficient, and portable parallelisation of your applications.

Fortran 90 defines two intrinsic procedures date_and_time and system_clock, which return elapsed wall-clock time in various formats.

The date_and_time procedure takes 4 arguments, all of which are optional:

  1. date: intent(out), a character string at least 8 characters long
  2. time: intent(out), a character string at least 10 characters long
  3. zone: intent(out), a character string at least 5 characters long
  4. values: intent(out), an array of integers at least 8 entries long
For our purposes we don't need date or time returned as strings. We only need the numbers, which are returned in values, so we'll call this procedure using a keyword argument list:
   call date_and_time (values=time_array)
where time_array is our array of integers. The returned values will have the following ordering:
  1. time_array(1): year
  2. time_array(2): month of the year
  3. time_array(3): day of the month
  4. time_array(4): time offset with respect to UTC in minutes
  5. time_array(5): hour of the day
  6. time_array(6): minutes of the hour
  7. time_array(7): seconds of the minute
  8. time_array(8): milliseconds of the second

Subroutine system_clock is somewhat easier to use. It takes 3 optional arguments:

  1. count: intent(out), an integer
  2. count_rate: intent(out), an integer
  3. count_max: intent(out), an integer
This function is somewhat similar to C-function clock, in the sense that it counts time at a rate of count_rate counts per second up to count_max, and then resets itself to zero and resumes the counting. But unlike clock this function measures wall-clock time, not the CPU time. As you will see from the following example, under AIX procedure system_clock resets every day at midnight. But this particular behaviour is not specified in F90 standard.

In fact there is no intrinsic Fortran-90 procedure for measuring CPU time. For that we have to use XL-Fortran service and utility function etime_. At this stage the program ceases to be portable, so it is a good idea to isolate the parts of the code that rely on etime_ with cpp #ifdef .. #endif brackets. In the example below I use gcc -E -P -C instead of cpp. It is important to remove cpp generated line references before passing the file to Fortran compiler. Option -P ensures that. The importance of option -C will become clearer in our next Fortran-90 example.

Function etime_ is defined in the xlfutility module, which must be included with the use statement:

use xlfutility
The function takes a structure of type tb_type as argument (intent(out)) and returns the sum of system and user components of the CPU time since the start of the execution of a process. Additionally user time and system time are written on usrtime and systime slots of the argument.

For more information about service and utility procedures provided in xlfutility read the ``XL Fortran for AIX, Language Reference, Version 3, Release 2'' manual, pages 445-451. Note: this link points to a file which is outside of the WWW directory tree and it will work only if you are reading this document on the QPSF system. The manual, in compressed PostScript, can be found in the /usr/lpp/xlf/ps directory on nodes s1n01 and s1n02, file xlflr.ps.Z.

The following Fortran 90 program shows how to use all three procedures in order to time your computation.

      program f_time

#ifdef XLF
      use xlfutility

! Variables for function dtime_

      real(4) elapsed_0, elapsed_1
      type (tb_type) etime_struct_0, etime_struct_1
#endif

! Variables for subroutine system_clock

      integer count_0, count_1, count_rate, count_max

! Variables for subroutine date_and_time

      integer time_array_0(8), time_array_1(8)
      real start_time, finish_time

! Variables for computation

      integer n
      parameter (n = 1000000)
      double precision a(n), b(n), c(n)

      write (6, '(1x, 1a)') 'using F90 procedure date_and_time ...'
      write (6, '(1x, 1a)') 'using F90 procedure system_clock ...'
#ifdef XLF
      write (6, '(1x, 1a)') 'using XLF function dtime_ ...'
#endif
      
! Mark the beginning of the program

      call date_and_time(values=time_array_0)
      start_time = time_array_0 (5) * 3600 + time_array_0 (6) * 60 &
           + time_array_0 (7) + 0.001 * time_array_0 (8)
      call system_clock(count_0, count_rate, count_max)
#ifdef XLF
      elapsed_0 = etime_(etime_struct_0)
#endif

      write (6, '(8x, 1a, 1f16.6)') 'begin (date_and_time):  ', &
           start_time
      write (6, '(8x, 1a, 1f16.6)') 'begin (system_clock):   ', &
           count_0 * 1.0 / count_rate
#ifdef XLF
      write (6, '(8x, 1a, 1f16.6)') 'begin (etime_%usrtime): ', &
           etime_struct_0%usrtime
      write (6, '(8x, 1a, 1f16.6)') 'begin (etime_%systime): ', &
           etime_struct_0%systime
#endif

! Sleep for 5 seconds

#ifdef XLF
      write (6, '(16x, 1a)') 'sleep for 5 seconds ... '
      call sleep_ (5)
#endif

! Perform some computation

      write (6, '(16x, 1a)') 'perform some computation ... '
      a = (/ (i, i = 1, n) /)
      a = sqrt(a)
      b = 1.0 / a
      c = b - a      

! Mark the end of the program

      call date_and_time(values=time_array_1)
      finish_time = time_array_1 (5) * 3600 + time_array_1 (6) * 60 &
           + time_array_1 (7) + 0.001 * time_array_1 (8)
      call system_clock(count_1, count_rate, count_max)
#ifdef XLF
      elapsed_1 = etime_(etime_struct_1)
#endif

      write (6, '(8x, 1a, 1f16.6)') 'end (date_and_time):    ', &
           finish_time
      write (6, '(8x, 1a, 1f16.6)') 'end (system_clock):     ', &
           count_1 * 1.0 / count_rate
#ifdef XLF
      write (6, '(8x, 1a, 1f16.6)') 'end (etime_%usrtime):   ', &
           etime_struct_1%usrtime
      write (6, '(8x, 1a, 1f16.6)') 'end (etime_%systime):   ', &
           etime_struct_1%systime
#endif

! Print elapsed time

      write (6, '(8x, 1a, 1f16.6)') 'elapsed wall clock time:', &
           finish_time - start_time           
#ifdef XLF
      write (6, '(8x, 1a, 1f16.6)') 'elapsed CPU time:       ', &
           etime_struct_1%usrtime - etime_struct_0%usrtime
#endif

      end program f_time

This file must be passed through cpp first in order to generate the plain Fortran code. Then the code must be compiled and linked with Fortran-90 compiler. The most convenient way to go about all that is to write appropriate instructions on a Makefile and use make to generate the binary. Here is the Makefile used for our F90 example code:

F90     = xlf90
CPP     = gcc -E -P -C
DEFINES = -DXLF
OPTS    = # -g

all: f_time

f_time: f_time.o
	$(F90) $(OPTS) -o f_time f_time.o

f_time.o: f_time.f
	$(F90) $(OPTS) -c f_time.f

f_time.f: f_time.cpp
	$(CPP) $(DEFINES) f_time.cpp > f_time.f

clean: 
	rm -f f_time.f f_time.o f_time

The compilation now proceeds as follows:

gustav@s1n01:~/src/try/f90 383 $ make
gcc -E -P -C -DXLF f_time.cpp > f_time.f
xlf90  -c f_time.f
** f_time   === End of Compilation 1 ===
1501-510  Compilation successful for file f_time.f.
xlf90  -o f_time f_time.o
gustav@s1n01:~/src/try/f90 384 $ 

And the program itself can be run like that:

gustav@s1n01:~/src/try/f90 384 $ time ./f_time
 using F90 procedure date_and_time ...
 using F90 procedure system_clock ...
 using XLF function dtime_ ...
        begin (date_and_time):      64470.828125
        begin (system_clock):       64470.820312
        begin (etime_%usrtime):         0.000000
        begin (etime_%systime):         0.020000
                sleep for 5 seconds ... 
                perform some computation ... 
        end (date_and_time):        64478.429688
        end (system_clock):         64478.421875
        end (etime_%usrtime):           2.510000
        end (etime_%systime):           0.020000
        elapsed wall clock time:        7.601562
        elapsed CPU time:               2.510000

Real   7.76
User   2.51
System 0.02
gustav@s1n01:~/src/try/f90 385 $ 

And, as before, we can see that our internal estimates agree pretty well with results returned by UNIX program time. There is a small discrepancy of 0.16 s in the estimate of the elapsed wall-clock time (7.60 s versus 7.76 s), the explanation of which is left to the reader as an exercise. Also observe that time returned by procedure system_clock is roughly the same as time returned by procedure date_and_time, which means that system_clock must be reset at midnight, as I have already remarked above.

Go to

Timing a Job in Lisp

A truly marvellous thing about Common Lisp is that it is almost completely independent of the semantics of the operating system it runs on. It provides its own semantics for just about everything including even manipulation of a directory tree.

Common Lisp time functions are discussed in CLtL2 in chapter 25, ``Miscellaneous Features'', section 25.4, ``Environment Inquiries'', subsection 1, pages 702-705.

The particular version of Common Lisp I will work here with is clisp-1996-08-23 developed by Bruno Haible, Michael Stoll and Marcus Daniels. It is largely conformant to CLtL2, has CLOS, and is currently evolving towards ANSI CL. All time functions discussed in CLtL2 are available.

Common Lisp has a function, which is similar to Fortran 90 procedure date_and_time. The function is get-decoded-time. It returns 9 values, which have to be captured with the multiple-value-bind macro. For example

(defun date ()
  (multiple-value-bind (second minute hour day month year day-of-week
			       daylight-saving-time-p time-zone)
		       (get-decoded-time)
		       (format t "~a, ~a-~a-~a, ~a:~a:~a~%"
			       (case day-of-week
				     (0 "Monday")
				     (1 "Tuesday")
				     (2 "Wednesday")
				     (3 "Thursday")
				     (4 "Friday")
				     (5 "Saturday")
				     (6 "Sunday"))
			       day
			       (case month
				     (1 "Jan")
				     (2 "Feb")
				     (3 "Mar")
				     (4 "Apr")
				     (5 "May")
				     (6 "Jun")
				     (7 "Jul")
				     (8 "Aug")
				     (9 "Sep")
				     (10 "Oct")
				     (11 "Nov")
				     (12 "Dec"))
			       year
			       hour minute second)))

When you load this function and evaluate (date) you'll see something like:

> (date)
Friday, 20-Dec-1996, 13:38:47
NIL
> 

The similarity between get-decoded-time and date_and_time is not the only similarity between Fortran-90 and Common Lisp. There are many more, and the reason for those similarities is that Guy L. Steele Jr. sat on both ANSI committees and had a profound effect on the development of both languages.

But Common Lisp has a more convenient function for our needs, the function is get-universal-time, which returns a single integer number: the number of seconds since midnight, January 1, 1900 GMT. The function get-decoded-time is, in fact, a wrapper around get-universal-time and decode-universal-time. To see that simply try evaluating:

(get-decoded-time)
(get-universal-time)
(decode-universal-time (get-universal-time))

In C-language there is a function called ctime, which is similar to the Common Lisp function decode-universal-time. It converts the output of function time into a 26-character string, which yields day and time in a traditional human readable format.

The CPU time can be measured by using function get-internal-run-time, which returns a number of internal time units used by the Lisp process so far. The units can be converted to seconds as follows:

(/ (get-internal-run-time) internal-time-units-per-second)
This function is very similar to C-function clock and to Fortran-90 function system_clock, although the latter returns wall-clock time, not the CPU time.

The following example is a translation of our previous examples in C and in Fortran-90 to Common Lisp.

(format t "~&using function get-universal-time to measure wallclock time ...")
(format t "~&using function get-internal-run-time to measure CPU time ...")
(setq wall-clock-t0 (get-universal-time))
(setq cpu-t0 (/ (* 1.0 (get-internal-run-time)) 
		internal-time-units-per-second))
(format t "~&~1,8@Tbegin (wall):            ~a" wall-clock-t0)
(format t "~&~1,8@Tbegin (cpu):             ~a" cpu-t0)
(format t "~&~1,16@Tsleep for 5 seconds ...")
(sleep 5)
(format t "~&~1,16@Tperform some computation ...")
(do ((i 1 (+ i 1))
     (sum 0.0))
    ((= i 10000) sum)
    (setf sum (+ sum
		 (let ((a (sqrt i)))
		   (- (/ 1.0 a) a)))))
(setq wall-clock-t1 (get-universal-time))
(setq cpu-t1 (/ (* 1.0 (get-internal-run-time)) 
		internal-time-units-per-second))
(format t "~&~1,8@Tend (wall):              ~a" wall-clock-t1)
(format t "~&~1,8@Tend (cpu):               ~a" cpu-t1)
(format t "~&~1,8@Telapsed wall clock time: ~a" (- wall-clock-t1
						   wall-clock-t0))
(format t "~&~1,8@Telapsed CPU time:        ~a" (- cpu-t1 cpu-t0))

It can be run and timed independently as follows:

gustav@s1n01:~/src/try/cl 222 $ time clisp -q << EOF
gustav@s1n01:~/src/try/cl 223 > (load "l_time.lsp")
gustav@s1n01:~/src/try/cl 223 > EOF
;; Loading file l_time.lsp ...
using function get-universal-time to measure wallclock time ...
using function get-internal-run-time to measure CPU time ...
        begin (wall):            3060055909
        begin (cpu):             0.29
                sleep for 5 seconds ...
                perform some computation ...
        end (wall):              3060055917
        end (cpu):               2.78
        elapsed wall clock time: 8
        elapsed CPU time:        2.49
;; Loading of file l_time.lsp is finished.
T

Real   8.60
User   2.70
System 0.09
gustav@s1n01:~/src/try/cl 223 $ 

This time the discrepancy is somewhat larger than both for our Fortran-90 and C examples. That is because the Common Lisp interpreter takes a long time to load. But the difference is only of the order of about a second, and if you plan to time a program that will take two hours to execute, a second more or less won't matter. You should always reserve at least a few minutes for a clean-up and resubmission of your job.

Go to


Restoring and Saving the State of a Job

In this section we shall discuss how to save and then restore the state of the computation between successive invocations of a program via LoadLeveler. The basic idea is that the only way any information can be transferred between successive invocations of a program is either

  1. through a file, or
  2. through an environmental variable, or
  3. through command line switches
Transferring data through a file is perhaps the most common practice. Using files you can transfer very large amounts of data: e.g., the whole state of a 3D flow, or the whole state of a protein, or the whole state of a car in a crash simulation. Basically, files can be used to transfer any information from one instantiation of a program to another, including even small items of information, such as whether the program should restart a computation from a previously reached state, or whether it should start a new computation.

Instead of writing on files, the program, in principle, can also write on user's environment. On next invocation the program can check for existence and state of certain predefined environmental variables, and obtain required information that way. This method is good for transferring small amounts of information, e.g., the name of a checkpoint file, or the request to initialise a run, but not for very large data sets.

Of course, using environmental variables will not work if the variables themselves are not transmitted from one LoadLeveler process to another one. There is a special LoadLeveler directive:

# @ environment = COPY_ALL
which instructs LoadLeveler to copy all environmental variables from the current shell and transfer them to the shell within which the job will be executed.

But there is one problem with writing on user's environment. This can be done portably only from within C (or C++). Fortran-90 provides an intrinsic procedure for reading environment, getenv, but not for writing on it. Common Lisp, in turn, specifies only that such procedures should be available in the implementation dependent system (nickname: sys) package, but does not specify exactly what should be in that package. Most Lisps, that I know of, have sys::getenv, but, again, not all of them have sys::putenv or sys::setenv.

So, we shall have to use some other mechanism to convey information about the name of the checkpoint file and whether the job should be continued, for example, we can write it at the end of a log file. After our application exits, the LoadLeveler script responsible for the execution of the application can inspect the log, and if it finds the instruction that the job should be continued, it can transfer that information to environmental variables, and resubmit itself. When LoadLeveler again gets to activate that script, our application will begin by checking for certain variables in the environment and for their content. From there it will learn if it should continue or reinitialise the computation, and if it should continue, where it should look for information about the state reached by the previous run.

Instead of using environment we could transfer the instruction to restart the computation and the name of the checkpoint file by using command line arguments. This is a neat way of doing things, but it's somewhat harder to program than reading the environment. You can use this mechanism portably with C and C++, but not with Fortran or Lisp programs. Almost all Fortrans, that I have worked with, support reading command line arguments, but they all do it differently, and, as I said, it's not a part of Fortran standard. Reading command line arguments in Lisp is, well, quite impossible, because most Lisp programs must be executed from within Lisp environment. Some Lisp systems, e.g., Allegro, let programmers build stand-alone applications, but this is outside of X3J13 specifications, and, at this stage, it cannot be done with clisp anyway.

Go to

Restoring and Saving in C

The following listing shows a very simple C-language program which, if requested, reads the state of computation from a file. If not requested it initialises a new computation. Then some further computation is performed and the new state is again saved on a file.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

main()
{
  char *restart_name, *restart, old_restart_name[BUFSIZ];
  FILE *restart_file;
  int n;

  /* Is this a continued job or a new one? */

  if (! (restart = getenv ("RSAVE_RESTART"))) {
    printf ("Starting a new run.\n");
    n = 0;
  }
  else {
    if (! (restart_name = getenv ("RSAVE_CHECKFILE"))) {
      fprintf (stderr, "error: no checkpoint file for the restart job\n");
      exit (1);
    }
    else {
      printf ("Restarting the job from %s.\n", restart_name);
      if (! (restart_file = fopen(restart_name, "r"))) {
	perror (restart_name);
	exit (2);
      }
      else {
	if (! (fscanf (restart_file, "%d", &n) > 0)) {
	  fprintf (stderr, "%s: input file format error\n", restart_name);
	  exit (3);
	}
	else {
	  fclose (restart_file);
	}
      }          
    }
  }

  printf ("n = %d\n", n);
  printf ("\tcomputing ... "); fflush (stdout);
  sleep (5);
  n++;
  printf ("done.\n");
  printf ("n = %d\n", n);

  if (! (restart_name = getenv ("RSAVE_CHECKFILE"))) {
    printf ("checkpointing not requested, exiting...\n");
    exit (0);
  }
  else {
    if (restart) {
      strcpy (old_restart_name, restart_name);
      strcat (old_restart_name, ".old");
      printf ("renaming old restart file to %s\n", old_restart_name);
      if (0 > rename (restart_name, old_restart_name)) {
	perror (old_restart_name);
	exit (4);
      }
    }
    printf ("saving data on %s\n", restart_name);
    if (! (restart_file = fopen (restart_name, "w"))) {
      perror (restart_name);
      exit (5);
    }
    else {
      fprintf (restart_file, "%d\n", n);
      fclose (restart_file);
    }
  }
  exit (0);
}

I'll explain how this program works in detail below, but first let's just see what it does:

gustav@s1n01:~/src/resubmit/c 395 $ env | grep RSAVE
RSAVE_CHECKFILE=rsave.dat
RSAVE_RESTART=yes
gustav@s1n01:~/src/resubmit/c 396 $ unset RSAVE_RESTART
gustav@s1n01:~/src/resubmit/c 397 $ env | grep RSAVE
RSAVE_CHECKFILE=rsave.dat
gustav@s1n01:~/src/resubmit/c 398 $ ./rsave
Starting a new run.
n = 0
        computing ... done.
n = 1
saving data on rsave.dat
gustav@s1n01:~/src/resubmit/c 399 $ export RSAVE_RESTART="yes"
gustav@s1n01:~/src/resubmit/c 400 $ ./rsave
Restarting the job from rsave.dat.
n = 1
        computing ... done.
n = 2
renaming old restart file to rsave.dat.old
saving data on rsave.dat
gustav@s1n01:~/src/resubmit/c 401 $ cat rsave.dat
2
gustav@s1n01:~/src/resubmit/c 402 $ ./rsave
Restarting the job from rsave.dat.
n = 2
        computing ... done.
n = 3
renaming old restart file to rsave.dat.old
saving data on rsave.dat
gustav@s1n01:~/src/resubmit/c 403 $ ./rsave
Restarting the job from rsave.dat.
n = 3
        computing ... done.
n = 4
renaming old restart file to rsave.dat.old
saving data on rsave.dat
gustav@s1n01:~/src/resubmit/c 404 $ 

The Synopsis of the Program

Here is the promised explanation of the program in detail.

The first thing that the program does, is to check for the existence of the environmental variable RSAVE_RESTART. If the variable does not exist, the program starts a new run and initialises n to 0.

If the variable RSAVE_RESTART exists (it doesn't really matter what is its value) then we first check if another variable, which should specify the name of the checkpoint file, RSAVE_CHECKFILE, exists too. If it doesn't, then we have no way to find the name of the checkpoint file. So in that case we print an error message, flag an error on exit (value 1) and exit.

If the variable RSAVE_CHECKFILE exists then we use its value as the name of the checkpoint file, print a message about restarting the job from that file, and attempt to open it for reading.

If for some reason the file cannot be opened, we print the diagnostic on standard output (with perror), flag an error (value 2) and exit.

If the file has been opened without problems we try to read an integer number from it. That integer is the whole object of our simple computation in this program and it represents the state of the system.

It may happen that for some reason the checkpoint file does not contain that integer. In that case we print the corresponding error message, flag an error (value 3) and exit.

But if everything goes well, by this time we should have our state of the system in hand, so we close the checkpoint file (in case of an error exit the file would be closed automatically) and commence the computation.

The computation is quite trivial. We simply increment the integer read from the file by 1. In order to add a little more body to the program we also sleep for 5 seconds (this is called putting on weight). We will need that sleep in our next example, which will combine timing with saving and restoring.

Once the computation is finished we again check the environmental variable RSAVE_CHECKFILE. Observe that this variable has not been looked up so far by the branch of the program, that does the initialisation. That is why we do it here again, even though the other branch, which is responsible for the restarting of the job, would have looked it up already.

If the variable RSAVE_CHECKFILE is not defined, we write the message that ``checkpointing has not been requested'' and exit. No error condition is flagged this time.

If the variable RSAVE_CHECKFILE exists, and if the job is a restarted one, then we attempt to rename the original restart file to whatever its old name was with a suffix ".old" appended.

If for some reason that cannot be done, we print diagnostic on standard error using perror, flag an error (value 4) and exit.

Otherwise, having renamed the old restart file, we attempt to open, this time for writing, a new file bearing the old name. If for some reason that cannot be done a diagnostic is printed on standard error with perror, an error exit is flagged (value 5) and the program aborts.

Otherwise, i.e., if all went well and we have the new restart file opened, we write the new value of n on it, close it, and exit with status 0.

This is really quite simple stuff. Whatever complexity there is in the presented example, it derives from my attempt to make the program robust. Regardless of whether variables RSAVE_RESTART and RSAVE_CHECKFILE exist, regardless of whether the data file itself exists, the program should always do something more or less sensible, write meaningful error messages if need be, and exit gracefully conveying a meaningful exit value to the shell. For seasoned C and C++ programmers all that is just bread and butter.

Go to

Restoring and Saving in Fortran

Our Fortran-90 example does much the same as our C example, so if you have skipped the previous section, you should go back to ``The Synopsis of the Program'' and read it now. Again, I have attempted to make the program relatively robust, which adds to its complexity a little.

In Fortran-90 procedures, which operate on files, are implemented as subroutines, not as functions. For this reason, the Fortran version of our example does not flow as smoothly as our C program. Any I/O problems must be addressed by jumping to a specified label. It is customary to place all error handlers together at the end of the file.

When the cpp preprocessor is invoked on this file, we must use the -C option, i.e., we must preserve C (and C++) language comments. The reason for that is that Fortran string-append operator, //, is the same as the C++ comment marker. Without the -C option, all string-append operations would be stripped from the produced Fortran code.

There is no way to rename a file within Fortran-90. So, in order to save the checkpoint file under a new name I have to use the intrinsic subroutine system. Unfortunately, the way this subroutine is implemented, no exit status is returned to the calling Fortran program, so we have no means of checking, if the requested operation was successful.

For this reason, when the checkpoint file is opened for writing, I use the 'replace' status. If the renaming operation is unsuccessful, the old checkpoint file will be replaced with the new one.

Although the package xlfutility provides subroutine exit_, there is no need to call it here. If the stop statement is followed by a number, XL Fortran makes that number available to the parent shell as the exit status of the program.

Observe that Fortran-90 makes life of a Fortran programmer a lot easier. One of the most useful new Fortran-90 facilities is function len_trim, which returns the real length of a string with trailing blanks stripped. In the open statement you'll find a new directive, 'action', which specifies the kind of operation that will be attempted on the file, e.g., 'read' or 'write'.

Now, here is the Fortran-90 example itself:

program rsave

#ifdef XLF
  use xlfutility
#endif
  character (len=64) restart, restart_name, old_restart_name
  character (len=512) command
  integer n, restart_file, status
  parameter (restart_file = 21)

! Is this a continued job or a new one?

  call getenv ('RSAVE_RESTART', restart)
  if (len_trim(restart) .eq. 0) then
     write (6, '(1x, 1a)') 'Starting a new run'
     n = 0
  else
     call getenv ('RSAVE_CHECKFILE', restart_name)
     if (len_trim(restart_name) .eq. 0) then
        write (6, '(1x, 1a)') 'Error: no checkpoint file for the restart job'
        stop 1
     else
        write (6, '(1x, 2a)') 'Restarting the job from ', restart_name
        open (unit=restart_file, iostat=status, err=100, file=restart_name, &
             status='old', action='read')
        read (restart_file, '(1i7)', iostat=status, err=110, end=110) n
        close (restart_file)
     end if
  end if

! This is our computation part

  write (6, '(1x, 1a, 1i7)') 'n = ', n
  write (6, '(9x, 1a, $)') 'computing ... '
#ifdef XLF
  call flush_ (6)
#endif
  
#ifdef XLF
  call sleep_ (5)
#endif
  n = n + 1
  write (6, '(1a)') 'done.'
  write (6, '(1x, 1a, 1i7)') 'n = ', n

! And now we save the result on a new checkpoint file, saving
! the old one under a new name if need be.

  call getenv ('RSAVE_CHECKFILE', restart_name)
  if (len_trim(restart_name) .eq. 0) then
     write (6, '(1x, 1a)') 'Checkpointing not requested, exiting ... '
     stop 0
  else
     if (.not. (len_trim(restart) .eq. 0)) then
        old_restart_name = restart_name (1:len_trim(restart_name)) // '.old'
        write (6, '(1x, 2a)') 'Renaming the old restart file to ', &
             old_restart_name
        command = 'mv' // ' ' // restart_name // ' ' // old_restart_name
        call system (command)
     end if
     write (6, '(1x, 2a)') 'Saving data on ', restart_name
     open (unit=restart_file, iostat=status, err=120, file=restart_name, &
             status='replace', action='write')
     write (restart_file, '(1i7)') n
     close (restart_file)
  end if
  stop 0

! error handlers

! error while opening the checkpoint file for reading

100 write (6, '(1x, 3a)') 'Error: while opening ', restart_name, ' for reading'
  write (6, '(8x, 1a, 1i7)') 'iostat = ', status
  stop 2

! error while trying to read input file

110 write (6, '(1x, 2a)') 'Error: while reading from ', restart_name
  write (6, '(8x, 1a, 1i7)') 'iostat = ', status
  stop 3

! error while opening the checkpoint file for writing

120 write (6, '(1x, 3a)') 'Error: while opening ', restart_name, ' for writing'
  write (6, '(8x, 1a, 1i7)') 'iostat = ', status
  stop 5

end program rsave

Compile this program as follows:

gustav@s1n01:~/src/resubmit/f90 333 $ make
gcc -E -P -C -DXLF rsave.cpp > rsave.f
xlf90  -c rsave.f
** rsave   === End of Compilation 1 ===
1501-510  Compilation successful for file rsave.f.
xlf90  -o rsave rsave.o
gustav@s1n01:~/src/resubmit/f90 334 $ 

And run it like that:

gustav@s1n01:~/src/resubmit/f90 335 $ env | grep RSAVE
RSAVE_CHECKFILE=rsave.dat
RSAVE_RESTART=yes
gustav@s1n01:~/src/resubmit/f90 336 $ unset RSAVE_RESTART
gustav@s1n01:~/src/resubmit/f90 337 $ ./rsave
 Starting a new run
 n =       0
         computing ... done.
 n =       1
 Saving data on rsave.dat                                                       
STOP 0
gustav@s1n01:~/src/resubmit/f90 338 $ export RSAVE_RESTART=yes
gustav@s1n01:~/src/resubmit/f90 339 $ ./rsave
 Restarting the job from rsave.dat                                              
 n =       1
         computing ... done.
 n =       2
 Renaming the old restart file to rsave.dat.old                                 
 Saving data on rsave.dat                                                       
STOP 0
gustav@s1n01:~/src/resubmit/f90 340 $ ./rsave
 Restarting the job from rsave.dat                                              
 n =       2
         computing ... done.
 n =       3
 Renaming the old restart file to rsave.dat.old                                 
 Saving data on rsave.dat                                                       
STOP 0
gustav@s1n01:~/src/resubmit/f90 341 $ 

Go to

Restoring and Saving in Lisp

Now the same thing in Common Lisp.

Unlike Fortran-90 Common Lisp is well endowed in file manipulation utilities. Furthermore, like in C, every Common Lisp procedure returns a value, which should not be surprising, because Common Lisp is a fully functional programming language.

Consequently, our Common Lisp program flows quite nicely, and no jumps to error handlers are needed.

The logic of the program is much the same as the logic of our C example. So, again, if you've skipped the C example, go back to ``The Synopsis of the Program'' now, and read it (you can skip the Fortran example though). However, there are a few subtle Lispish differences here and there.

First, because Lisp lives and works in its own environment, we cannot ``exit'' to shell with a specific error status. (Actually, we should be able to, but clisp's function sys::exit returns only 0 or 1 regardless of the value of its argument.) So, instead our function rsave returns various integers if a problem occurs, and nil if there are no problems. The return is accomplished by evaluating the form (return-from rsave ...), which is discussed in Section 7.7, ``Blocks and Exits'' of CLtL2.

The other characteristic Lisp feature is the use of the form with-open-file, which automatically opens and closes a file. Observe that the first with-open-file clause returns the value of n, if the file has been successfully opened. But at this stage n is not guaranteed to be a number. It can be anything including nil - the latter if the file is empty. So we must test n for being a number, and we return from function rsave with ``exit status'' 3, if it is not.

The beauty of Lisp is that the way this procedure is written, it all appears upside down: the test of n is the first thing you see in the listing of the numberp clause, and the value of n is read from the checkpoint file at the end of the clause!

Function rename-file doesn't return anything interesting in case a problem occurs. Instead it signals error, which halts Lisp, and the corresponding error message is automatically printed by the Lisp interpreter.

(defun rsave ()
  ;;
  ;; Either initialise n to 0 or read it from a file
  ;;
  (if (not (setq restart (sys::getenv "RSAVE_RESTART")))
      (progn
	(format t "~&Starting a new run")
	(setq n 0))
    (if (not (setq restart-name (sys::getenv "RSAVE_CHECKFILE")))
	(progn
	  (format t "~&Error: no checkpoint file for the restart job")
	  (return-from rsave 1))
      (progn
	(format t "~&Restarting the job from ~a" restart-name)
	(when (not (numberp
		    (setq n (with-open-file 
			     (restart-file restart-name
					   :direction :input 
					   :element-type 'string-char
					   :if-does-not-exist nil)
			     (if (not (streamp restart-file))
				 (progn
				   (format t "~&Error: cannot open ~a for reading"
					   restart-name)
				   (return-from rsave 2))
			       (read restart-file nil nil))))))
	      (progn
		(format t "~&Error: bad input file format: ~a" restart-name)
		(return-from rsave 3))))))
  ;;
  ;; Now we begin our computation
  ;;
  (format t "~&n = ~a" n)
  (format t "~&~1,8@Tcomputing ... ")
  (sleep 5)
  (setf n (1+ n))
  (format t "done.")
  (format t "~&n = ~a" n)
  ;;
  ;; Save the result on a new restart file, if requested
  ;;
  (if (not (setq restart-name (sys::getenv "RSAVE_CHECKFILE")))
      (progn
	(format t "~&checkpointing not requested, exiting ... ")
	(return-from rsave nil))
    (progn
      (when restart
	    (let ((old-restart-name (concatenate 'string restart-name ".old")))
	      (format t "~&renaming old restart file to ~a" old-restart-name)
	      (rename-file restart-name old-restart-name)))
      (format t "~&Saving data on ~a" restart-name)
      (with-open-file (restart-file restart-name
				    :direction :output
				    :element-type 'string-char)
		      (if (not (streamp restart-file))
			  (progn
			    (format t "~&Error: cannot open ~a for writing"
				    restart-name)
			    (return-from rsave 5))
			(format restart-file "~a~%" n))))))

I have compiled this function with compile-file for faster loading and execution. The resulting file is called rsave.fas, and here is how this Lisp program can be run on our system:

gustav@s1n01:~/src/resubmit/lisp 466 $ env | grep RSAVE
RSAVE_CHECKFILE=rsave.dat
RSAVE_RESTART=yes
gustav@s1n01:~/src/resubmit/lisp 467 $ unset RSAVE_RESTART
gustav@s1n01:~/src/resubmit/lisp 468 $ lisp -q -i rsave.fas << EOF
gustav@s1n01:~/src/resubmit/lisp 469 > (rsave)
gustav@s1n01:~/src/resubmit/lisp 469 > EOF
;; Loading file rsave.fas ...
;; Loading of file rsave.fas is finished.
Starting a new run
n = 0
        computing ... done.
n = 1
Saving data on rsave.dat
NIL
gustav@s1n01:~/src/resubmit/lisp 469 $ export RSAVE_RESTART="yes"
gustav@s1n01:~/src/resubmit/lisp 470 $ lisp -q -i rsave.fas << EOF
gustav@s1n01:~/src/resubmit/lisp 471 > (rsave)
gustav@s1n01:~/src/resubmit/lisp 471 > EOF
;; Loading file rsave.fas ...
;; Loading of file rsave.fas is finished.
Restarting the job from rsave.dat
n = 1
        computing ... done.
n = 2
renaming old restart file to rsave.dat.old
Saving data on rsave.dat
NIL
gustav@s1n01:~/src/resubmit/lisp 471 $ lisp -q -i rsave.fas << EOF
gustav@s1n01:~/src/resubmit/lisp 472 > (rsave)
gustav@s1n01:~/src/resubmit/lisp 472 > EOF
;; Loading file rsave.fas ...
;; Loading of file rsave.fas is finished.
Restarting the job from rsave.dat
n = 2
        computing ... done.
n = 3
renaming old restart file to rsave.dat.old
Saving data on rsave.dat
NIL
gustav@s1n01:~/src/resubmit/lisp 472 $ 

Go to


Restoring, Timing, and Saving a Job: the Complete Application

In this section we shall combine job timing with job restoring and saving, and produce a complete application, which, in the next section, will be combined with a LoadLeveler script, so as to produce an automatically resubmitting job. As in the previous two sections we shall present example codes in C, Fortran-90 and in Common Lisp.

The program is a slight modification of our restore and save example. There are no really new elements here, which would require a broader explanation.

The additional logic that is laid out on top of the restore and save example is as follows.

We begin by checking for a new environmental variable, RSAVE_TIME_LIMIT. If that variable does not exist then we assume that time allowed for this job is unlimited and things work more or less as before. If the variable exists, then we attempt to read its value assuming that it is going to be a number. If it is not a number we print an error message and exit. If it is a number, then the number is assigned to variable time_limit and assumed to represent the number of wall-clock seconds allocated to this job.

On our system those queues, which are timed at all, are timed in terms of wall-clock seconds, not CPU seconds. After all what really matters to other users is how long they have to wait until your job gets out of the way. For this reason we use wall-clock timers, i.e., function time in C, subroutine system_clock in Fortran 90, and function get-universal-time in Common Lisp.

Once the information about time limit is obtained we proceed exactly as before, until we get to the part of the program which does the computation. Instead of just incrementing number n and sleeping for 5 seconds, we enter a loop.

If no timing has been requested the loop keeps incrementing n and sleeping, until n becomes greater than LAST_N. The latter is an arbitrary constant, which in our toy example represents something like a convergence criterion. Once the convergence has been reached, the finished flag is set to TRUE and the loop exits.

Things are more interesting if timing of the job has been requested (by setting the environmental variable RSAVE_TIME_LIMIT to some number of seconds). In that case we measure time taken by one iteration of the loop, and we check how much time there is still left after the iteration has finished. If there is still enough time to perform another iteration we continue, if not, the loop exits.

Because saving the data, cleaning up, and executing LoadLeveler script may take additional time we have to include a SAFETY_MARGIN while calculating time that still remains. In this case we set SAFETY_MARGIN to 10 seconds, but if you have to save a very large data set, you should probably reserve a couple of minutes.

Flagging the Resubmission

Before exiting, we check if the whole job is finished, which it will be once the convergence criterion is satisfied. If the job is finished we write FINISHED on standard output. Otherwise we write CONTINUE.

If the standard output has been logged on a file, after the program exits, the LoadLeveler script can inspect the log, and resubmit itself, if it finds the word CONTINUE in the log. How that works will be shown in the last section of this article.

Go to

The Complete Application in C

Here is the C version of the program. The wall-clock time is measured using the UNIX function time. The parameters LAST_N and SAFETY_MARGIN have been implemented as cpp constants. I could also read them from the environment, a command line, or from an input file, but that would clutter the example.

The program always executes the statements of the do ... while loop at least once, because the exit condition is tested at the end of the loop.

Observe that the variable quit_time is initialised to 1l. That way, if timing is not requested, it remains always positive and the while test fires up only when the job is finished. Furthermore the variable timing is initialised to TRUE, and becomes FALSE only if there is no environmental variable RSAVE_TIME_LIMIT. The job is assumed to be unfinished on entry (the variable finished is initialised to FALSE) and becomes finished only when n becomes greater than LAST_N. This means that once n becomes greater than LAST_N, you can still submit the job and it will always increment n by 1 before exiting.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

#ifndef TRUE
# define TRUE 1
#endif
#ifndef FALSE
# define FALSE 0
#endif

#ifndef LAST_N
# define LAST_N 30
#endif

#ifndef SAFETY_MARGIN
# define SAFETY_MARGIN 10
#endif

main()
{
  char *restart_name, *restart, old_restart_name[BUFSIZ];
  FILE *restart_file;
  int n, finished = FALSE, timing = TRUE;
  time_t t0, t1, t2, loop_time, time_left, time_limit, quit_time = 1l;
  char *time_limit_string;

  /* Check the clock at the beginning of the run */
  t0 = time(NULL);

  /* Check how much time we have for this job */
  if (! (time_limit_string = getenv ("RSAVE_TIME_LIMIT"))) {
    printf ("Unlimited time for this job.\n");
    timing = FALSE;
  }
  else {
    if (! (0 < sscanf (time_limit_string, "%d", &time_limit))) {
      fprintf (stderr, "Error: bad format of RSAVE_TIME_LIMIT\n");
      exit (1);
    }
    else {
      printf ("Time for this job limited to %d seconds.\n", time_limit);
    }
  }
    
  /* Is this a continued job or a new one? */

  if (! (restart = getenv ("RSAVE_RESTART"))) {
    printf ("Starting a new run.\n");
    n = 0;
  }
  else {
    if (! (restart_name = getenv ("RSAVE_CHECKFILE"))) {
      fprintf (stderr, "error: no checkpoint file for the restart job\n");
      exit (1);
    }
    else {
      printf ("Restarting the job from %s.\n", restart_name);
      if (! (restart_file = fopen(restart_name, "r"))) {
	perror (restart_name);
	exit (2);
      }
      else {
	if (! (fscanf (restart_file, "%d", &n) > 0)) {
	  fprintf (stderr, "%s: input file format error\n", restart_name);
	  exit (3);
	}
	else {
	  fclose (restart_file);
	}
      }          
    }
  }

  printf ("n = %d\n", n);
  printf ("\tcomputing ... \n"); fflush (stdout);

  /* Loop while keeping an eye on the clock */

  do {
    if (timing) t1 = time(NULL);

    sleep (5);
    n++;

    /* Check if the whole simulation has been finished: 
       this is our ``convergence'' criterion. 
       */
    if (n > LAST_N) finished = TRUE;

    /* Check if we still have enough time for the next loop.
       */
    if (timing) {
       t2 = time(NULL);
       loop_time = t2 - t1;
       time_left = time_limit - (t2 - t0);
       quit_time = time_left - loop_time - SAFETY_MARGIN;
       printf ("\t\tn = %d, time left = %d seconds\n", n, time_left);
       if ((quit_time <= 0) && (! finished))
         printf ("\t\tRun out of time, exiting ... \n");    
    }
  } while ((quit_time > 0) && (! finished));

  printf ("\tdone.\n");
  printf ("n = %d\n", n);

  if (! (restart_name = getenv ("RSAVE_CHECKFILE"))) {
    printf ("checkpointing not requested, exiting...\n");
    exit (0);
  }
  else {
    if (restart) {
      strcpy (old_restart_name, restart_name);
      strcat (old_restart_name, ".old");
      printf ("renaming old restart file to %s\n", old_restart_name);
      if (0 > rename (restart_name, old_restart_name)) {
	perror (old_restart_name);
	exit (4);
      }
    }
    printf ("saving data on %s\n", restart_name);
    if (! (restart_file = fopen (restart_name, "w"))) {
      perror (restart_name);
      exit (5);
    }
    else {
      fprintf (restart_file, "%d\n", n);
      fclose (restart_file);
    }
    if (! finished)
      printf ("CONTINUE\n");
    else
      printf ("FINISHED\n");
  }
  exit (0);
}

Here is how this job is run. First I submit it with the environmental variable RSAVE_RESTART unset, which initialises the job. Then I set RSAVE_RESTART to yes and resubmit the job, which restarts from where it left.

The job is allowed to run no longer than 30 seconds at a time. Given the safety margin of 10 seconds and a single iteration time of 5 seconds this should let our program do 4 iterations. But the while clause tests for quit_time > 0 not for quit_time >= 0, so, in effect we end up with 3 iterations instead of 4.

While the computational task remains unfinished, program rts writes CONTINUE on standard output before it exits. But the last run, when n becomes 31, is flagged with the word FINISHED.

gustav@s1n01:~/src/resubmit/c 446 $ env | grep RSAVE
RSAVE_TIME_LIMIT=30
RSAVE_CHECKFILE=rts.dat
RSAVE_RESTART=yes
gustav@s1n01:~/src/resubmit/c 447 $ unset RSAVE_RESTART
gustav@s1n01:~/src/resubmit/c 448 $ ./rts
Time for this job limited to 30 seconds.
Starting a new run.
n = 0
        computing ... 
                n = 1, time left = 25 seconds
                n = 2, time left = 20 seconds
                n = 3, time left = 15 seconds
                Run out of time, exiting ... 
        done.
n = 3
saving data on rts.dat
CONTINUE
gustav@s1n01:~/src/resubmit/c 449 $ export RSAVE_RESTART=yes
gustav@s1n01:~/src/resubmit/c 450 $ ./rts
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 3
        computing ... 
                n = 4, time left = 25 seconds
                n = 5, time left = 20 seconds
                n = 6, time left = 15 seconds
                Run out of time, exiting ... 
        done.
n = 6
renaming old restart file to rts.dat.old
saving data on rts.dat
CONTINUE
gustav@s1n01:~/src/resubmit/c 451 $ ./rts
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 6
        computing ... 
                n = 7, time left = 25 seconds
                n = 8, time left = 20 seconds
                n = 9, time left = 15 seconds
                Run out of time, exiting ... 
        done.
n = 9
renaming old restart file to rts.dat.old
saving data on rts.dat
CONTINUE
gustav@s1n01:~/src/resubmit/c 452 $ 

 ...

gustav@s1n01:~/src/resubmit/c 458 $ ./rts
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 27
        computing ... 
                n = 28, time left = 25 seconds
                n = 29, time left = 20 seconds
                n = 30, time left = 15 seconds
                Run out of time, exiting ... 
        done.
n = 30
renaming old restart file to rts.dat.old
saving data on rts.dat
CONTINUE
gustav@s1n01:~/src/resubmit/c 459 $ ./rts
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 30
        computing ... 
                n = 31, time left = 25 seconds
        done.
n = 31
renaming old restart file to rts.dat.old
saving data on rts.dat
FINISHED
gustav@s1n01:~/src/resubmit/c 460 $ 

Go to

The Complete Application in Fortran 90

Below is the same code written in Fortran-90.

The wall-clock time is measured using the intrinsic subroutine system_clock. As I have already remarked, the number of ticks returned by this subroutine is reset to 0 every midnight. In order to avoid a catastrophe at midnight, we save the clock value returned by the first call to system_clock in clock0. On all consecutive calls we allways check if the returned value is less than clock0, which it will be if the clock has reset in the meantime. If we observe such an event, we add clock_max to clock and use the result in our computations. Assuming that your job will not block the queue for more than 24 hours that should work just fine, otherwise additional day counters would have to be included in the logic of the program.

This is basically the only difference between our C and our Fortran-90 versions of the program.

I have made a more extensive use of the cpp preprocessor in this code. All major constants have been defined using the #ifndef .. #endif clauses at the beginning of the listing. This way their values can be altered from the command line, using the -D switch, while generating the Fortran-90 code with gcc -E -P -C.

The logic of the do loop differs slightly from the logic of the do loop in C, because the while condition is tested at the beginning of the loop. However, the default values of quit_time and finished are such that the loop will be always executed at least once. So, in effect, things should work here exactly as in our C example. The initialisation of quit_time to 1 also ensures that if timing has not been requested by the user, the job will continue running, until finished. The default values of timing, and finished have the same effect as in the C example.

#ifndef STDOUT
# define STDOUT 6
#endif

#ifndef RESTART_FILE
# define RESTART_FILE 21
#endif

#ifndef LAST_N
# define LAST_N 30
#endif

#ifndef SAFETY_MARGIN
# define SAFETY_MARGIN 10
#endif

#ifndef SHORT_STRING_LEN
# define SHORT_STRING_LEN 64
#endif

#ifndef LONG_STRING_LEN
# define LONG_STRING_LEN 512
#endif

program rts

  ! R)estore T)ime S)ave

#ifdef XLF
  use xlfutility
#endif
  character (len=SHORT_STRING_LEN) restart, restart_name, old_restart_name
  character (len=LONG_STRING_LEN) command
  integer n, restart_file, status
  parameter (restart_file = RESTART_FILE)

  ! Variables for timing

  integer t0, t1, t2, loop_time, time_left, time_limit, quit_time, &
       count0, count, count_rate, count_max, safety_margin
  character (len=SHORT_STRING_LEN) time_limit_string
  logical timing
  data quit_time /1/, timing /.true./, safety_margin /SAFETY_MARGIN/

  ! Variables for finishing the task

  integer last_n
  logical finished
  data finished /.false./, last_n /LAST_N/

  ! Look up the clock at the beginning of the run

  call system_clock (count0, count_rate, count_max)
  t0 = count0 / count_rate

  ! Check how much time we have for this job

  call getenv ('RSAVE_TIME_LIMIT', time_limit_string)
  if (len_trim (time_limit_string) .eq. 0) then
     write (STDOUT, '(1x, 1a)') 'Unlimited time for this job.'
     timing = .false.
  else
     read (time_limit_string, '(1i7)', iostat=status, err=130, end=130) &
          time_limit
     write (STDOUT, '(1x, 1a, 1i7, 1a)') 'Time for this job limited to ', &
          time_limit, ' seconds'
  end if

  ! Is this a continued job or a new one?

  call getenv ('RSAVE_RESTART', restart)
  if (len_trim(restart) .eq. 0) then
     write (STDOUT, '(1x, 1a)') 'Starting a new run'
     n = 0
  else
     call getenv ('RSAVE_CHECKFILE', restart_name)
     if (len_trim(restart_name) .eq. 0) then
        write (STDOUT, '(1x, 1a)') &
             'Error: no checkpoint file for the restart job'
        stop 1
     else
        write (STDOUT, '(1x, 2a)') 'Restarting the job from ', restart_name
        open (unit=restart_file, iostat=status, err=100, file=restart_name, &
             status='old', action='read')
        read (restart_file, '(1i7)', iostat=status, err=110, end=110) n
        close (restart_file)
     end if
  end if

  ! This is our computation part

  write (STDOUT, '(1x, 1a, 1i7)') 'n = ', n
  write (STDOUT, '(9x, 1a)') 'computing ... '
#ifdef XLF
  call flush_ (STDOUT)
#endif

  do while ((quit_time .gt. 0) .and. (.not. finished))
     if (timing) then
        call system_clock(count=count)
        if (count .lt. count0) count = count + count_max
        t1 = count / count_rate
     end if

#ifdef XLF
     call sleep_ (5)
#endif
     n = n + 1

     ! Check if the whole simulation has been finished:
     ! this is our ``convergence'' criterion

     if (n > last_n) finished = .true.

     ! Check if we still have enough time for the next loop

     if (timing) then
        call system_clock(count=count)
        if (count .lt. count0) count = count + count_max
        t2 = count / count_rate
        loop_time = t2 - t1
        time_left = time_limit - (t2 - t0)
        quit_time = time_left - loop_time - safety_margin
        write (STDOUT, '(16x, 1a, 1i7, 1a, 1i7, 1a)') &
             'n = ', n, ' time left = ', time_left, ' seconds'
        if ((quit_time .le. 0) .and. (.not. finished)) &
             write (STDOUT, '(16x, 1a)') 'Run out of time, exiting ... '
     end if
  end do

  write (STDOUT, '(9x, 1a)') 'done.'
  write (STDOUT, '(1x, 1a, 1i7)') 'n = ', n

  ! And now we save the result on a new checkpoint file, saving
  ! the old one under a new name if need be.

  call getenv ('RSAVE_CHECKFILE', restart_name)
  if (len_trim(restart_name) .eq. 0) then
     write (STDOUT, '(1x, 1a)') 'Checkpointing not requested, exiting ... '
     stop 0
  else
     if (.not. (len_trim(restart) .eq. 0)) then
        old_restart_name = restart_name (1:len_trim(restart_name)) // '.old'
        write (STDOUT, '(1x, 2a)') 'Renaming the old restart file to ', &
             old_restart_name
        command = 'mv' // ' ' // restart_name // ' ' // old_restart_name
        call system (command)
     end if
     write (STDOUT, '(1x, 2a)') 'Saving data on ', restart_name
     open (unit=restart_file, iostat=status, err=120, file=restart_name, &
          status='replace', action='write')
     write (restart_file, '(1i7)') n
     close (restart_file)
     if (.not. finished) then
        write (STDOUT, '(1x, 1a)') 'CONTINUE'
     else
        write (STDOUT, '(1x, 1a)') 'FINISHED'
     end if
  end if
  stop 0

  ! error handlers

  ! error while opening the checkpoint file for reading

100 write (STDOUT, '(1x, 3a)') 'Error: while opening ', restart_name, &
       ' for reading'
  write (STDOUT, '(8x, 1a, 1i7)') 'iostat = ', status
  stop 2

  ! error while trying to read input file

110 write (STDOUT, '(1x, 2a)') 'Error: while reading from ', restart_name
  write (STDOUT, '(8x, 1a, 1i7)') 'iostat = ', status
  stop 3

  ! error while opening the checkpoint file for writing

120 write (STDOUT, '(1x, 3a)') 'Error: while opening ', restart_name, &
       ' for writing'
  write (STDOUT, '(8x, 1a, 1i7)') 'iostat = ', status
  stop 5

  ! error while trying to read from time_limit_string

130 write (STDOUT, '(1x, 1a)') 'Error: bad format of RSAVE_TIME_LIMIT'
  write (STDOUT, '(8x, 1a, 1i7)') 'iostat = ', status
  stop 6

end program rts

The program can be compiled as follows:

gustav@s1n01:~/src/resubmit/f90 284 $ make
gcc -E -P -C -DXLF rts.cpp > rts.f
xlf90  -c rts.f
** rts   === End of Compilation 1 ===
1501-510  Compilation successful for file rts.f.
xlf90  -o rts rts.o
gustav@s1n01:~/src/resubmit/f90 285 $ 

And here is how I've run it. Observe another subtle difference between our C and Fortran-90 examples: when the Fortran program exits, apart from writing CONTINUE or FINISHED it also writes STOP 0. If our LoadLeveler script was to inspect only the last line of the log file for the word CONTINUE, we would have missed it in this case. So, instead, the script will grep through the whole file. Of course, this assumes that a new log file will be created each time.

gustav@s1n01:~/src/resubmit/f90 288 $ env | grep RSAVE
RSAVE_TIME_LIMIT=30
RSAVE_CHECKFILE=rts.dat
gustav@s1n01:~/src/resubmit/f90 289 $ ./rts
 Time for this job limited to      30 seconds
 Starting a new run
 n =       0
         computing ... 
                n =       1 time left =      25 seconds
                n =       2 time left =      20 seconds
                n =       3 time left =      15 seconds
                Run out of time, exiting ... 
         done.
 n =       3
 Saving data on rts.dat                                                         
 CONTINUE
STOP 0
gustav@s1n01:~/src/resubmit/f90 290 $ export RSAVE_RESTART=yes
gustav@s1n01:~/src/resubmit/f90 291 $ ./rts
 Time for this job limited to      30 seconds
 Restarting the job from rts.dat                                                
 n =       3
         computing ... 
                n =       4 time left =      25 seconds
                n =       5 time left =      20 seconds
                n =       6 time left =      15 seconds
                Run out of time, exiting ... 
         done.
 n =       6
 Renaming the old restart file to rts.dat.old                                   
 Saving data on rts.dat                                                         
 CONTINUE
STOP 0
gustav@s1n01:~/src/resubmit/f90 292 $ ./rts
 Time for this job limited to      30 seconds
 Restarting the job from rts.dat                                                
 n =       6
         computing ... 
                n =       7 time left =      25 seconds
                n =       8 time left =      20 seconds
                n =       9 time left =      15 seconds
                Run out of time, exiting ... 
         done.
 n =       9
 Renaming the old restart file to rts.dat.old                                   
 Saving data on rts.dat                                                         
 CONTINUE
STOP 0
gustav@s1n01:~/src/resubmit/f90 293 $ 

   ...

gustav@s1n01:~/src/resubmit/f90 295 $ ./rts
 Time for this job limited to      30 seconds
 Restarting the job from rts.dat                                                
 n =      30
         computing ... 
                n =      31 time left =      25 seconds
         done.
 n =      31
 Renaming the old restart file to rts.dat.old                                   
 Saving data on rts.dat                                                         
 FINISHED
STOP 0
gustav@s1n01:~/src/resubmit/f90 296 $ 

Go to

The Complete Application in Common Lisp

Whereas my two previous Lisp examples were basically C (or Fortran) programs translated to Lisp (there is a saying that a determined Fortran programmer can write Fortran programs in any language), this time I have made an effort and rewritten the program entirely in a proper Lisp style. To a reader unacquainted with Lisp this way of doing things may be quite unpalatable, because the sequence of actions, so dear to C and Fortran programmers, can be hard to extract from the listing at first glance. But Lisp programmers know where too look for specific events in the listing, and they can easily locate the key fragments of the code, if it's written in a proper Lisp style, and correctly indented.

However, style apart, the program still performs exactly the same tasks as our C and Fortran examples.

In order to time the computation I use the function get-universal-time. This function is much better than Fortran-90 subroutine system_clock because, like UNIX function time, it counts seconds since the Epoch, and does not reset the counter every now and then. The Epoch in this case is the New Year of 1900, which is quite apt, because I happen to be writing these very words in the evening on the 31st of December 1996!

The basic structure of the program is very simple. It consists of a large let* form, the body of which is the do* form. Come to think of it, it should be possible to rewrite it yet again using the do* form only, because the initialisations of let* could be merged with the initialisations of do*. Well, I'll leave it to the reader as an exercise.

Checking the environmental variables and reading the input file is performed as part of the initialisation actions of the let* form.

Then the do* form initialises its own timing variables, evaluates the forms of the body and the incremental forms a few times, and finally, when the termination condition fires up, the state of the system is saved on a file while evaluating the forms of the result part of the do* loop.

Other than that there is nothing particularly new in this code, since the timing and file manipulation functions have already been discussed in previous sections. There are two subtle differences between the way this program works and the way my C and Fortran examples behaved.

The first difference is that the statement which writes the value of n and the remaining time is executed always at the beginning of the loop, because it is implemented as a part of the loop termination test, and in the do* form, like in Fortran 90, the test is performed at the beginning of each iteration.

The second difference is that there is no variable timing in the code. Instead the variable time-limit is used in two roles: first as a boolean flag, and second as a place holder for the number of seconds given to the job.

(defun rts ()
  "[r]estart, [t]ime and [s]ave: a toy Common Lisp application - invoked
without arguments."
  (let* ((t0 (get-universal-time))
	 (time-limit 
	  (let ((time-limit-string 
		 (sys::getenv "RSAVE_TIME_LIMIT")))
	    (if time-limit-string
		(let ((x (read-from-string time-limit-string)))
		  (if (numberp x) 
		      (progn
			(format t "~&Time for this job limited to ~a seconds" x)
			x)
		    (progn
		      (format t "~&Error: bad format of RSAVE_TIME_LIMIT")
		      (return-from rts 1))))
	      (format t "~&Unlimited time for this job"))))
	 (restart (sys::getenv "RSAVE_RESTART"))
	 (restart-name (sys::getenv "RSAVE_CHECKFILE"))
	 (n (if (not restart)
		(progn
		  (format t "~&Starting a new run")
		  0)
	      (if (not restart-name)
		  (progn
		    (format t "~&Error: no checkpoint file for the restart job")
		    (return-from rts 2))
		(progn
		  (format t "~&Restarting the job from ~a" restart-name)
		  (let ((x (with-open-file
			    (restart-file restart-name
					  :direction :input 
					  :element-type 'string-char
					  :if-does-not-exist nil)
			    (if (not (streamp restart-file))
				(progn
				  (format t "~&Error: cannot open ~a for reading"
					  restart-name)
				  (return-from rts 3))
			      (read restart-file nil nil)))))
		    (if (numberp x)
			x
		      (progn
			(format t "~&Error: bad input file format: ~a" restart-name)
			(return-from rts 4))))))))
	 (last-n 30)
	 (finished (> n last-n)))

    (format t "~&n = ~a" n)
    (format t "~&~1,8@Tcomputing ... ")

    (do* ((safety-margin 10)
	  (t1 (when time-limit (get-universal-time))
	      (when time-limit (get-universal-time)))
	  (t2 (when time-limit t1))
	  (loop-time (when time-limit 0))
	  (time-left (when time-limit (- time-limit (- t2 t0)))
		     (when time-limit (- time-limit (- t2 t0))))
	  (quit-time (when time-limit (- time-left loop-time safety-margin))
		     (when time-limit (- time-left loop-time safety-margin))))
	 ((or finished (when time-limit
			     (progn
			       (format t "~&~1,16@Tn = ~a, time left = ~a seconds" n time-left)
			       (<= quit-time 0))))
	  (when (not finished)
		(format t "~&~1,16@TRun out of time, exiting ... "))
	  (format t "~&~1,8@Tdone.")
	  (format t "~&n = ~a" n)
	  (if (not restart-name)
	      (progn
		(format t "~&checkpointing not requested, exiting ... ")
		(return-from rts nil))
	    (progn
	      (when restart
		    (let ((old-restart-name (concatenate 'string restart-name ".old")))
		      (format t "~&renaming old restart file to ~a" old-restart-name)
		      (rename-file restart-name old-restart-name)))
	      (format t "~&Saving data on ~a" restart-name)
	      (with-open-file (restart-file restart-name
					    :direction :output
					    :element-type 'string-char)
			      (if (not (streamp restart-file))
				  (progn
				    (format t "~&Error: cannot open ~a for writing"
					    restart-name)
				    (return-from rts 5))
				(format restart-file "~a~%" n)))
	      (if finished
		  (format t "~&FINISHED")
		(format t "~&CONTINUE")))))
		   
	 (sleep 5)
	 (setf n (1+ n))
	 (setf finished (> n last-n))
	 (when time-limit
	       (progn
		 (setf t2 (get-universal-time))
		 (setf loop-time (- t2 t1)))))))

I have compiled this function with compile-file (the compiled image has been saved on rts.fas) and here is how the program can be run on our system:

gustav@s1n01:~/src/resubmit/lisp 259 $ env | grep RSAVE
RSAVE_TIME_LIMIT=30
RSAVE_CHECKFILE=rts.dat
RSAVE_RESTART=yes
gustav@s1n01:~/src/resubmit/lisp 260 $ unset RSAVE_RESTART
gustav@s1n01:~/src/resubmit/lisp 261 $ lisp -q -i rts.fas << EOF
gustav@s1n01:~/src/resubmit/lisp 262 > (rts)
gustav@s1n01:~/src/resubmit/lisp 262 > EOF
;; Loading file rts.fas ...
;; Loading of file rts.fas is finished.
Time for this job limited to 30 seconds
Starting a new run
n = 0
        computing ... 
                n = 0, time left = 30 seconds
                n = 1, time left = 25 seconds
                n = 2, time left = 20 seconds
                n = 3, time left = 15 seconds
                Run out of time, exiting ... 
        done.
n = 3
Saving data on rts.dat
CONTINUE
NIL
gustav@s1n01:~/src/resubmit/lisp 262 $ export RSAVE_RESTART=yes
gustav@s1n01:~/src/resubmit/lisp 263 $ lisp -q -i rts.fas << EOF
gustav@s1n01:~/src/resubmit/lisp 264 > (rts)
gustav@s1n01:~/src/resubmit/lisp 264 > EOF
;; Loading file rts.fas ...
;; Loading of file rts.fas is finished.
Time for this job limited to 30 seconds
Restarting the job from rts.dat
n = 3
        computing ... 
                n = 3, time left = 29 seconds
                n = 4, time left = 24 seconds
                n = 5, time left = 19 seconds
                n = 6, time left = 14 seconds
                Run out of time, exiting ... 
        done.
n = 6
renaming old restart file to rts.dat.old
Saving data on rts.dat
CONTINUE
NIL
gustav@s1n01:~/src/resubmit/lisp 264 $ 

   ...

Go to


Combining the Application with LoadLeveler: Automatic Resubmission

In this section I shall demonstrate how our toy application can be run under the LoadLeveler, and how you can use its various features to automatically keep resubmitting the job until the whole computational task is finished.

What makes it particularly easy is the LoadLeveler's #@environment=COPY_ALL statement, which transfers all currently defined environmental variables to the submitted job. That way we can define, say, RSAVE_RESTART in the script, after the first, initialising run of the application, and rest assured that when the job is resubmitted, it will already read the data from the restart file.

The LoadLeveler script begins by running program ./rts: that is our application. The output is saved on rts.log:

./rts > rts.log

Both C and Fortran examples are invoked in the same way. In order to run the Lisp example replace that line with

lisp -q -i rts.fas << EOF > rts.log
(rts)
EOF

Naturally, the definition of #@initialdir must correspond to where all the binaries and data files live!

After the job exits the script performs a number of quite interesting manipulations. First of all, it checks if an environmental variable RSAVE_STEP exists. That variable is used to number our LoadLeveler runs. It is quite like LoadLeveler's variable $(stepid), with the difference that here we do it all ourselves. If the variable exists, it means that this particular run was already a resubmission. In that case the value of RSAVE_STEP is incremented and the old restart file, say, rts.dat.old is renamed to something like rts.dat.3,, where 3 is the RSAVE_STEP number. That way we keep the log of the whole computation. In a more complex application, the rts.dat files could contain images or three dimensional data sets, which, if saved, could be used to produce an animation or a CAVE display.

If the variable RSAVE_STEP does not exist, it means that this is the initialising run. In that case the variable is created and assigned number 0. Because we export it, it will become available to the next instantiation of the job.

The log file, rts.log is also saved on something like, say, rts.log.3, where 3 is the RSAVE_STEP number. Observe that rts.log.3 corresponds to the run that used rts.dat.3 as its restart file.

After these manipulations we inspect the log file itself and check if it contains the word CONTINUE. If it does, we check if the variable RSAVE_RESTART exists. If it doesn't, it means that this was the first, initialising run. So we create that variable. Once created it will become available to the next instantiation of the job via the #@environment=COPY_ALL mechanism. Either way the job is resubmitted with the command

llsubmit $LOADL_STEP_COMMAND
where $LOADL_STEP_COMMAND evaluates to the name of the LoadLeveler script itself.

If the word CONTINUE has not been found in the log file, then we check if the log file contains the word FINISHED. If the job is FINISHED it is not resubmitted. Instead a mail message is sent to whoever submitted the job in the first place ($LOADL_STEP_OWNER), informing the addressee that the job has been completed.

If neither the word CONTINUE nor the word FINISHED have been found in the log file, it means that an error condition must have occurred and the job exited mid-way. In that case, the job is not resubmitted and a mail message informing about the error is sent to the $LOADL_STEP_OWNER.

Here is the whole LoadLeveler script in full glory:

# @ shell = /opt/gnu/bin/bash
# @ environment = COPY_ALL
# @ job_name = rts
# @ output = $(job_name).$(jobid).out
# @ error = $(job_name).$(jobid).err
# @ class = half_hour
# @ notification = always
# @ initialdir = /home/qpsf/gustav/src/resubmit/c
# @ queue
cd $LOADL_STEP_INITDIR
#
# Execute this step.
#
./rts > rts.log
#
# If there is $RSAVE_CHECKFILE.old file then
# replace the suffix ".old" with a step number.
#
if [ -n "${RSAVE_STEP}" ]
then
   export RSAVE_STEP=`expr $RSAVE_STEP + 1`
   if [ -n "${RSAVE_CHECKFILE}" ]
   then
      if [ -f $RSAVE_CHECKFILE.old ]
      then
         mv $RSAVE_CHECKFILE.old $RSAVE_CHECKFILE.$RSAVE_STEP
      fi
   fi
else
   export RSAVE_STEP=0
fi
# 
# also save the log of this run 
#
cp rts.log rts.log.$RSAVE_STEP
#
# Check if the job is finished and if it is not
# resubmit this file
#
if grep CONTINUE rts.log
then
   if [ -z "${RSAVE_RESTART}" ]
   then
      export RSAVE_RESTART=yes
   fi
   llsubmit $LOADL_STEP_COMMAND
elif grep FINISHED rts.log
then
   mailx $LOADL_STEP_OWNER << EOF
Your job rts has FINISHED
EOF
else
   mailx $LOADL_STEP_OWNER << EOF
rts: error exit, check the log file
EOF
fi

Here is how this script is submitted and what happens afterwards.

gustav@s1n01:~/src/resubmit/c 615 $ env | grep RSAVE
RSAVE_TIME_LIMIT=30
RSAVE_CHECKFILE=rts.dat
gustav@s1n01:~/src/resubmit/c 615 $ llsubmit rts.ll
submit: The job "s1n01.16643" has been submitted.
gustav@s1n01:~/src/resubmit/c 617 $ 

Observe that only RSAVE_TIME_LIMIT and RSAVE_CHECKFILE have been defined. All other variables will be defined by the LoadLeveler script as they become needed.

The job runs happily resubmitting itself every time the program rts exits and producing numerous log and data files:

gustav@s1n01:~/src/resubmit/c 626 $ ls rts*
  12 rts*              4 rts.9.out         4 rts.dat.5         4 rts.log.10
   4 rts.10.out        4 rts.c             4 rts.dat.6         4 rts.log.2
   4 rts.104.out       4 rts.c~            4 rts.dat.7         4 rts.log.3
   0 rts.11.out        4 rts.dat           4 rts.dat.8         4 rts.log.4
   4 rts.12.out        4 rts.dat.1         4 rts.dat.9         4 rts.log.5
   4 rts.16643.out     4 rts.dat.10        4 rts.ll*           4 rts.log.6
   4 rts.18.out        4 rts.dat.2         4 rts.log           4 rts.log.7
   4 rts.19.out        4 rts.dat.3         4 rts.log.0         4 rts.log.8
   4 rts.7.out         4 rts.dat.4         4 rts.log.1         4 rts.log.9
gustav@s1n01:~/src/resubmit/c 627 $ 

The rts.dat.* files contain the evolution (or animation) of the system:

gustav@s1n01:~/src/resubmit/c 630 $ cat `/bin/ls -t rts.dat.*`
30
27
24
21
18
15
12
9
6
3
gustav@s1n01:~/src/resubmit/c 631 $ 

The rts.log.* files contain the log of the whole computation:

gustav@s1n01:~/src/resubmit/c 631 $ cat `/bin/ls -t rts.log.*`
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 30
        computing ... 
                n = 31, time left = 25 seconds
        done.
n = 31
renaming old restart file to rts.dat.old
saving data on rts.dat
FINISHED
Time for this job limited to 30 seconds.
Restarting the job from rts.dat.
n = 27
        computing ... 
                n = 28, time left = 25 seconds
                n = 29, time left = 20 seconds
                n = 30, time left = 15 seconds
                Run out of time, exiting ... 
        done.
n = 30
renaming old restart file to rts.dat.old
saving data on rts.dat
CONTINUE

   ...

Time for this job limited to 30 seconds.
Starting a new run.
n = 0
        computing ... 
                n = 1, time left = 25 seconds
                n = 2, time left = 20 seconds
                n = 3, time left = 15 seconds
                Run out of time, exiting ... 
        done.
n = 3
saving data on rts.dat
CONTINUE
gustav@s1n01:~/src/resubmit/c 632 $ 

And the rts.*.out files contain messages from the LoadLeveler script in its various instantiations:

gustav@s1n01:~/src/resubmit/c 633 $ cat `/bin/ls -t rts.*.out`
CONTINUE
submit: The job "s1n08.11" has been submitted.
CONTINUE
submit: The job "s1n09.19" has been submitted.
CONTINUE
submit: The job "s1n10.12" has been submitted.
CONTINUE
submit: The job "s1n08.10" has been submitted.
CONTINUE
submit: The job "s1n05.7" has been submitted.
CONTINUE
submit: The job "s1n03.11" has been submitted.
CONTINUE
submit: The job "s1n06.11" has been submitted.
CONTINUE
submit: The job "s1n07.104" has been submitted.
gustav@s1n01:~/src/resubmit/c 634 $ 

Every time a new job is submitted from the LoadLeveler script a message has been sent to me, for example:

From:	Zdzislaw Meglicki 
To:	gustav@qpsf.edu.au
Subject: rts
Date:	Mon, 30 Dec 1996 16:17:14 +1000

From: LoadLeveler@s1n08.qpsf.edu.au

LoadLeveler Job: rts
Your job step, " s1n09.qpsf.edu.au.19.0" has started.

Starter information:

         Submitted: Mon Dec 30 16:17:11 1996

        Executable: rts.ll
     Job Step Type: NonParallel
       MachineName: s1n08.qpsf.edu.au
       Architecure: R6000
  Operating System: AIX41

And every time a given instantiation of the job exits another message comes in:

From:	LoadLeveler 
To:	gustav@qpsf.edu.au
Subject: rts
Date:	Mon, 30 Dec 1996 16:17:32 +1000

From: LoadLeveler

Your LoadLeveler job step
	 s1n09.qpsf.edu.au.19.0 (rts.ll )
has exited.

Status for machine s1n08.qpsf.edu.au:
	The job step exited normally with code 0

This job step was dispatched to run 1 time.
This job step was rejected by a starter 0 times..

Submitted at: Mon Dec 30 16:17:11 1996
Exited    at: Mon Dec 30 16:17:31 1996

               Real Time:   0 00:00:20
      Job Step User Time:   0 00:00:00
    Job Step System Time:   0 00:00:00
     Total Job Step Time:   0 00:00:00

       Starter User Time:   0 00:00:00
     Starter System Time:   0 00:00:00
      Total Starter Time:   0 00:00:00

If you run a long job, which resubmits itself twice or perhaps only once a day, those messages keep you informed about the progress of the computation.


For help and programming or academic assistance e-mail gustav@indiana.edu
Please e-mail any feedback related to this document to webmaster@beige.ucs.indiana.edu

[DocId:resubmit.html, Version:1.18, Date:98/07/10 14:45:50]