So, how does one pass a job on to LoadLeveler for execution, and having passed, how does one check what's going on with that job, and having checked and changed one's mind, how does one cancel the job?
One prepares a job by creating a job description file. A LoadLeveler job description file comprises a number of LoadLeveler directives, and possibly also a shell or a Perl script that follows the directives.
LoadLeveler directives are a little like High Performance Fortran directives. To a shell or to Perl that may be invoked by LoadLeveler to interpret the script they look like comments, so they ignore them. But LoadLeveler reads the directives and performs various additional actions as instructed.
Here is an example of a LoadLeveler job description file:
gustav@sp20:../LoadLeveler 20:06:15 !577 $ cat echo.ll #@ output = echo.out #@ error = echo.err #@ class = test #@ environment = COPY_ALL #@ executable = /afs/ovpit.indiana.edu/@sys/gnu/bin/echo #@ arguments = hello world #@ queue gustav@sp20:../LoadLeveler 20:06:17 !578 $
LoadLeveler directives begin with #@, which to
all IEEE-1003.2 compliant shells and to Perl looks like
this is a comment. Unfortunately neither Common Lisp
nor Scheme interpret # as a comment character.
It would be nice if LoadLeveler directive flag could
be changed.
The LoadLeveler directive flag, #@,
must be followed by a keyword, such as
output or class, and this, in turn, may be followed
by additional parameters, if that is required by the keyword.
In simplest situations you would specify an executable to be run by LoadLeveler by something like:
#@ executable = /afs/ovpit.indiana.edu/@sys/gnu/bin/echoand if additional command line arguments need to be used, you would specify them by something like:
#@ arguments = hello world
The job itself is queued by the directive:
#@ queueThis keyword must appear in the LoadLeveler job description file at least once.
Having prepared the job description file, submit it with
the command llsubmit:
gustav@sp20:../LoadLeveler 20:10:16 !583 $ ls echo.ll gustav@sp20:../LoadLeveler 20:18:48 !584 $ llsubmit echo.ll submit: The job "sp20.26" has been submitted. gustav@sp20:../LoadLeveler 20:18:54 !585 $ ls echo.ll echo.out gustav@sp20:../LoadLeveler 20:18:59 !586 $ cat echo.out hello world gustav@sp20:../LoadLeveler 20:19:02 !587 $
Assuming that the job executed correctly and without errors or diagnostics, the output will be left on whatever file you have specified with the
#@ output = echo.outdirective.
In this case the job has been sent to class test following
the directive:
#@ class = test
The job may not run for a while depending on what other jobs there
are in the system, queue priorities, job priorities, etc. In general
you cannot predict which node exactly will the job run on. In this
case it may run on any node that supports the test class.
When the job completes, you will find file echo.out in your
working directory and inside the file the two magic words:
gustav@sp20:../LoadLeveler 20:20:57 !588 $ cat echo.out hello world gustav@sp20:../LoadLeveler 20:23:20 !589 $
You can accomplish a similar result by submitting the following LoadLeveler job description file:
gustav@sp20:../LoadLeveler 20:26:29 !600 $ cat echo-2.ll #@ output = echo-2.out #@ error = echo-2.err #@ class = test #@ environment = COPY_ALL #@ shell = /afs/ovpit.indiana.edu/@sys/gnu/bin/bash #@ queue echo hello world gustav@sp20:../LoadLeveler 20:26:33 !601 $This time we tell LoadLeveler that the job description file is a shell script and that it should invoke
/afs/ovpit.indiana.edu/@sys/gnu/bin/bashto interpret it. The script itself contains just:
echo hello worldand, indeed, when the job completes, you will find a file
echo.out
in your working directory with the following words in it:gustav@sp20:../LoadLeveler 20:26:33 !601 $ cat echo-2.out hello world gustav@sp20:../LoadLeveler 20:28:28 !602 $
There is a significant difference between the two runs.
In the first case we run the stand-alone echo binary
from
/afs/ovpit.indiana.edu/@sys/gnu/binIn the second case the binary is not
echo but bash,
and bash built-in echo is used to print hello world
on standard output.
This job is way to short to capture it on the queue, unless there are no spare slots left and the job has to wait. But assuming that you'd have a long job or that the queue would be fully occupied, how would you check what is happening to your job?
The command is llq. Here's an example of how it works:
gustav@sp20:../LoadLeveler 20:30:48 !603 $ llq Id Owner Submitted ST PRI Class Running On ------------------------ ---------- ----------- -- --- ------------ ----------- sp01.71.0 rshoward 1/9 08:04 R 50 b sp01 sp01.82.0 rshoward 1/10 08:27 R 50 b sp02 sp01.5.22 wfischer 1/2 08:03 R 50 pb sp06 sp01.5.21 wfischer 1/2 08:03 R 50 pb sp11 sp02.1974.0 eisenste 1/11 03:57 R 50 b sp13 sp05.1150.0 kang 1/8 16:06 R 50 b sp27 libra.1849.0 kapihaka 1/10 16:20 R 50 b sp28 libra.1838.0 tachim 1/5 16:48 R 50 b sp32 sp02.2000.0 eisenste 1/12 03:24 R 50 b sp33 sp01.5.20 wfischer 1/2 08:03 R 50 pb sp34 sp01.5.19 wfischer 1/2 08:03 R 50 pb sp35 sp02.1953.0 eisenste 1/8 02:57 R 50 b sp36 sp01.134.0 tghanty 1/12 20:27 R 50 a sp40 sp01.128.0 tghanty 1/12 12:56 R 50 a sp41 sp01.133.0 tghanty 1/12 19:48 R 50 a sp43 sp02.1955.0 eisenste 1/8 03:01 R 50 b sp46 sp02.2001.0 eisenste 1/12 03:26 NQ 50 b sp01.5.23 wfischer 1/2 08:03 NQ 50 pb sp01.5.24 wfischer 1/2 08:03 NQ 50 pb sp01.5.25 wfischer 1/2 08:03 NQ 50 pb sp01.5.26 wfischer 1/2 08:03 NQ 50 pb sp01.5.27 wfischer 1/2 08:03 NQ 50 pb sp01.69.0 wfischer 1/8 21:41 NQ 50 pb sp01.69.1 wfischer 1/8 21:41 NQ 50 pb sp01.69.2 wfischer 1/8 21:41 NQ 50 pb sp01.69.3 wfischer 1/8 21:41 NQ 50 pb sp01.5.0 wfischer 1/2 08:03 C 50 pb sp01.5.1 wfischer 1/2 08:03 C 50 pb sp01.5.2 wfischer 1/2 08:03 C 50 pb sp01.5.3 wfischer 1/2 08:03 C 50 pb sp01.5.4 wfischer 1/2 08:03 C 50 pb sp01.5.5 wfischer 1/2 08:03 RM 50 pb sp01.5.6 wfischer 1/2 08:03 C 50 pb sp01.5.7 wfischer 1/2 08:03 C 50 pb sp01.5.8 wfischer 1/2 08:03 C 50 pb sp01.5.9 wfischer 1/2 08:03 C 50 pb sp01.5.10 wfischer 1/2 08:03 C 50 pb sp01.5.11 wfischer 1/2 08:03 C 50 pb sp01.5.12 wfischer 1/2 08:03 C 50 pb sp01.5.13 wfischer 1/2 08:03 C 50 pb sp01.5.14 wfischer 1/2 08:03 C 50 pb sp01.5.15 wfischer 1/2 08:03 C 50 pb sp01.5.16 wfischer 1/2 08:03 C 50 pb sp01.5.17 wfischer 1/2 08:03 C 50 pb sp01.5.18 wfischer 1/2 08:03 C 50 pb 26 jobs in queue 0 waiting, 0 pending, 16 running, 10 held. gustav@sp20:../LoadLeveler 20:30:58 !604 $This listing tells us a lot of things. For example that Mr Will Fischer is hogging the system and that Mary Papakhian should ever so gently infuse some sanity into him. It also tells us that there is little point submitting any jobs to the
pb class, because the queue is
clogged with Mr Fischer's jobs.
The command llq supports various options. For example,
to list only jobs in class b type
gustav@sp20:../LoadLeveler 20:40:37 !614 $ llq -c b Id Owner Submitted ST PRI Class Running On ------------------------ ---------- ----------- -- --- ------------ ----------- sp01.71.0 rshoward 1/9 08:04 R 50 b sp01 sp01.82.0 rshoward 1/10 08:27 R 50 b sp02 sp02.1974.0 eisenste 1/11 03:57 R 50 b sp13 sp05.1150.0 kang 1/8 16:06 R 50 b sp27 libra.1849.0 kapihaka 1/10 16:20 R 50 b sp28 libra.1838.0 tachim 1/5 16:48 R 50 b sp32 sp02.2000.0 eisenste 1/12 03:24 R 50 b sp33 sp02.1953.0 eisenste 1/8 02:57 R 50 b sp36 sp02.1955.0 eisenste 1/8 03:01 R 50 b sp46 sp02.2001.0 eisenste 1/12 03:26 NQ 50 b 10 jobs in queue 0 waiting, 0 pending, 9 running, 1 held. gustav@sp20:../LoadLeveler 20:40:59 !615 $
The listing tells us who and when submitted the jobs, what the jobs' priority is, what they run on, when they finally do, and which class they've been submitted to. Also, what is the job ID, and what it the job's current status. The status is one of the following:
How does one put a hold on a job? One issues the command llhold giving
it the job ID as an argument:
gustav@sp20:../LoadLeveler 20:52:18 !626 $ llhold sp01.5.23 hold: Hold command has been sent to central manager on "sp01.ucs.indiana.edu" gustav@sp20:../LoadLeveler 20:52:34 !627 $It's not going to work here, because it's not my job.
To release a job from hold type:
gustav@sp20:../LoadLeveler 20:54:38 !635 $ llhold -r sp01.5.23 hold: Hold command has been sent to central manager on "sp01.ucs.indiana.edu" gustav@sp20:../LoadLeveler 20:54:54 !636 $
It may happen that you want to cancel the job altogether. The command
to do that is llcancel, e.g.,
gustav@sp20:../LoadLeveler 20:55:42 !639 $ llcancel sp01.5.20 llcancel: Cancel command has been sent to central manager on "sp01.ucs.indiana.edu" gustav@sp20:../LoadLeveler 20:56:23 !640 $