next up previous index
Next: Specification of LoadLeveler Jobs Up: Working with LoadLeveler Previous: LoadLeveler Configuration

Submitting, Inspecting, and Cancelling LoadLeveler Jobs

So, how does one pass a job on to LoadLeveler for execution, and having passed, how does one check what's going on with that job, and having checked and changed one's mind, how does one cancel the job?

One prepares a job by creating a job description file. A LoadLeveler job description file comprises a number of LoadLeveler directives, and possibly also a shell or a Perl script that follows the directives.

LoadLeveler directives are a little like High Performance Fortran directives. To a shell or to Perl that may be invoked by LoadLeveler to interpret the script they look like comments, so they ignore them. But LoadLeveler reads the directives and performs various additional actions as instructed.

Here is an example of a LoadLeveler job description file:

gustav@sp20:../LoadLeveler 20:06:15 !577 $ cat echo.ll
#@ output      = echo.out
#@ error       = echo.err
#@ class       = test
#@ environment = COPY_ALL
#@ executable  = /afs/ovpit.indiana.edu/@sys/gnu/bin/echo
#@ arguments   = hello world
#@ queue
gustav@sp20:../LoadLeveler 20:06:17 !578 $

LoadLeveler directives begin with #@, which to all IEEE-1003.2 compliant shells and to Perl looks like this is a comment. Unfortunately neither Common Lisp nor Scheme interpret # as a comment character. It would be nice if LoadLeveler directive flag could be changed. The LoadLeveler directive flag, #@, must be followed by a keyword, such as output or class, and this, in turn, may be followed by additional parameters, if that is required by the keyword.

In simplest situations you would specify an executable to be run by LoadLeveler by something like:

#@ executable  = /afs/ovpit.indiana.edu/@sys/gnu/bin/echo
and if additional command line arguments need to be used, you would specify them by something like:
#@ arguments   = hello world

The job itself is queued by the directive:

#@ queue
This keyword must appear in the LoadLeveler job description file at least once.

Having prepared the job description file, submit it with the command llsubmit:

gustav@sp20:../LoadLeveler 20:10:16 !583 $ ls
echo.ll
gustav@sp20:../LoadLeveler 20:18:48 !584 $ llsubmit echo.ll
submit: The job "sp20.26" has been submitted.
gustav@sp20:../LoadLeveler 20:18:54 !585 $ ls
echo.ll   echo.out
gustav@sp20:../LoadLeveler 20:18:59 !586 $ cat echo.out
hello world
gustav@sp20:../LoadLeveler 20:19:02 !587 $

Assuming that the job executed correctly and without errors or diagnostics, the output will be left on whatever file you have specified with the

#@ output      = echo.out
directive.

In this case the job has been sent to class test following the directive:

#@ class       = test

The job may not run for a while depending on what other jobs there are in the system, queue priorities, job priorities, etc. In general you cannot predict which node exactly will the job run on. In this case it may run on any node that supports the test class.

When the job completes, you will find file echo.out in your working directory and inside the file the two magic words:

gustav@sp20:../LoadLeveler 20:20:57 !588 $ cat echo.out
hello world
gustav@sp20:../LoadLeveler 20:23:20 !589 $

You can accomplish a similar result by submitting the following LoadLeveler job description file:

gustav@sp20:../LoadLeveler 20:26:29 !600 $ cat echo-2.ll
#@ output      = echo-2.out
#@ error       = echo-2.err
#@ class       = test
#@ environment = COPY_ALL
#@ shell       = /afs/ovpit.indiana.edu/@sys/gnu/bin/bash
#@ queue
echo hello world
gustav@sp20:../LoadLeveler 20:26:33 !601 $
This time we tell LoadLeveler that the job description file is a shell script and that it should invoke
/afs/ovpit.indiana.edu/@sys/gnu/bin/bash
to interpret it. The script itself contains just:
echo hello world
and, indeed, when the job completes, you will find a file echo.out in your working directory with the following words in it:
gustav@sp20:../LoadLeveler 20:26:33 !601 $ cat echo-2.out
hello world
gustav@sp20:../LoadLeveler 20:28:28 !602 $

There is a significant difference between the two runs. In the first case we run the stand-alone echo binary from

/afs/ovpit.indiana.edu/@sys/gnu/bin
In the second case the binary is not echo but bash, and bash built-in echo is used to print hello world on standard output.

This job is way to short to capture it on the queue, unless there are no spare slots left and the job has to wait. But assuming that you'd have a long job or that the queue would be fully occupied, how would you check what is happening to your job?

The command is llq. Here's an example of how it works:

gustav@sp20:../LoadLeveler 20:30:48 !603 $ llq
Id                       Owner      Submitted   ST PRI Class        Running On 
------------------------ ---------- ----------- -- --- ------------ -----------
sp01.71.0                rshoward    1/9  08:04 R  50  b            sp01       
sp01.82.0                rshoward    1/10 08:27 R  50  b            sp02       
sp01.5.22                wfischer    1/2  08:03 R  50  pb           sp06       
sp01.5.21                wfischer    1/2  08:03 R  50  pb           sp11       
sp02.1974.0              eisenste    1/11 03:57 R  50  b            sp13       
sp05.1150.0              kang        1/8  16:06 R  50  b            sp27       
libra.1849.0             kapihaka    1/10 16:20 R  50  b            sp28       
libra.1838.0             tachim      1/5  16:48 R  50  b            sp32       
sp02.2000.0              eisenste    1/12 03:24 R  50  b            sp33       
sp01.5.20                wfischer    1/2  08:03 R  50  pb           sp34       
sp01.5.19                wfischer    1/2  08:03 R  50  pb           sp35       
sp02.1953.0              eisenste    1/8  02:57 R  50  b            sp36       
sp01.134.0               tghanty     1/12 20:27 R  50  a            sp40       
sp01.128.0               tghanty     1/12 12:56 R  50  a            sp41       
sp01.133.0               tghanty     1/12 19:48 R  50  a            sp43       
sp02.1955.0              eisenste    1/8  03:01 R  50  b            sp46       
sp02.2001.0              eisenste    1/12 03:26 NQ 50  b                       
sp01.5.23                wfischer    1/2  08:03 NQ 50  pb                      
sp01.5.24                wfischer    1/2  08:03 NQ 50  pb                      
sp01.5.25                wfischer    1/2  08:03 NQ 50  pb                      
sp01.5.26                wfischer    1/2  08:03 NQ 50  pb                      
sp01.5.27                wfischer    1/2  08:03 NQ 50  pb                      
sp01.69.0                wfischer    1/8  21:41 NQ 50  pb                      
sp01.69.1                wfischer    1/8  21:41 NQ 50  pb                      
sp01.69.2                wfischer    1/8  21:41 NQ 50  pb                      
sp01.69.3                wfischer    1/8  21:41 NQ 50  pb                      
sp01.5.0                 wfischer    1/2  08:03 C  50  pb                      
sp01.5.1                 wfischer    1/2  08:03 C  50  pb                      
sp01.5.2                 wfischer    1/2  08:03 C  50  pb                      
sp01.5.3                 wfischer    1/2  08:03 C  50  pb                      
sp01.5.4                 wfischer    1/2  08:03 C  50  pb                      
sp01.5.5                 wfischer    1/2  08:03 RM 50  pb                      
sp01.5.6                 wfischer    1/2  08:03 C  50  pb                      
sp01.5.7                 wfischer    1/2  08:03 C  50  pb                      
sp01.5.8                 wfischer    1/2  08:03 C  50  pb                      
sp01.5.9                 wfischer    1/2  08:03 C  50  pb                      
sp01.5.10                wfischer    1/2  08:03 C  50  pb                      
sp01.5.11                wfischer    1/2  08:03 C  50  pb                      
sp01.5.12                wfischer    1/2  08:03 C  50  pb                      
sp01.5.13                wfischer    1/2  08:03 C  50  pb                      
sp01.5.14                wfischer    1/2  08:03 C  50  pb                      
sp01.5.15                wfischer    1/2  08:03 C  50  pb                      
sp01.5.16                wfischer    1/2  08:03 C  50  pb                      
sp01.5.17                wfischer    1/2  08:03 C  50  pb                      
sp01.5.18                wfischer    1/2  08:03 C  50  pb                      

26 jobs in queue 0 waiting, 0 pending, 16 running, 10 held.
gustav@sp20:../LoadLeveler 20:30:58 !604 $
This listing tells us a lot of things. For example that Mr Will Fischer is hogging the system and that Mary Papakhian should ever so gently infuse some sanity into him. It also tells us that there is little point submitting any jobs to the pb class, because the queue is clogged with Mr Fischer's jobs.

The command llq supports various options. For example, to list only jobs in class b type

gustav@sp20:../LoadLeveler 20:40:37 !614 $ llq -c b
Id                       Owner      Submitted   ST PRI Class        Running On 
------------------------ ---------- ----------- -- --- ------------ -----------
sp01.71.0                rshoward    1/9  08:04 R  50  b            sp01       
sp01.82.0                rshoward    1/10 08:27 R  50  b            sp02       
sp02.1974.0              eisenste    1/11 03:57 R  50  b            sp13       
sp05.1150.0              kang        1/8  16:06 R  50  b            sp27       
libra.1849.0             kapihaka    1/10 16:20 R  50  b            sp28       
libra.1838.0             tachim      1/5  16:48 R  50  b            sp32       
sp02.2000.0              eisenste    1/12 03:24 R  50  b            sp33       
sp02.1953.0              eisenste    1/8  02:57 R  50  b            sp36       
sp02.1955.0              eisenste    1/8  03:01 R  50  b            sp46       
sp02.2001.0              eisenste    1/12 03:26 NQ 50  b                       

10 jobs in queue 0 waiting, 0 pending, 9 running, 1 held.
gustav@sp20:../LoadLeveler 20:40:59 !615 $

The listing tells us who and when submitted the jobs, what the jobs' priority is, what they run on, when they finally do, and which class they've been submitted to. Also, what is the job ID, and what it the job's current status. The status is one of the following:

C
The job has completed.
D
The job has been deferred.
H
A user put a hold on the job.
I
No machine has been selected for the job yet. The job is idle.
NQ
The job is not currently considered to run on any machine. It has not been queued.
NR
The job is badly formulated: there are unresolvable dependencies on other jobs in it, so it will never run.
P
The job is in the process of starting on one or more machines: it is pending.
R
The job is running.
RM
The job was removed (cancelled) either by the user or by LoadLeveler.
RP
The job is in the process of being removed: not all machines have responded yet, so the removal is pending.
S
The LoadLeveler administrator, i.e., Mary Papakhian, has put the job on hold. This is called system hold.
SH
Both the LoadLeveler administrator and the user have put the job on hold.
ST
The job has been dispatched and received by a target machine. LoadLeveler is setting up the environment for the job. The job is said to be starting.
V
The job did not complete for some reason. It has been vacated.

How does one put a hold on a job? One issues the command llhold giving it the job ID as an argument:

gustav@sp20:../LoadLeveler 20:52:18 !626 $ llhold sp01.5.23
hold: Hold command has been sent to central manager on "sp01.ucs.indiana.edu"
gustav@sp20:../LoadLeveler 20:52:34 !627 $
It's not going to work here, because it's not my job.

To release a job from hold type:

gustav@sp20:../LoadLeveler 20:54:38 !635 $ llhold -r sp01.5.23
hold: Hold command has been sent to central manager on "sp01.ucs.indiana.edu"
gustav@sp20:../LoadLeveler 20:54:54 !636 $

It may happen that you want to cancel the job altogether. The command to do that is llcancel, e.g.,

gustav@sp20:../LoadLeveler 20:55:42 !639 $ llcancel sp01.5.20 
llcancel: Cancel command has been sent to central manager on "sp01.ucs.indiana.edu"
gustav@sp20:../LoadLeveler 20:56:23 !640 $


next up previous index
Next: Specification of LoadLeveler Jobs Up: Working with LoadLeveler Previous: LoadLeveler Configuration
Zdzislaw Meglicki
2001-02-26