Before you can begin working with LoadLeveler on the SP you must first find how it is configured. And before you can understand how LoadLeveler has been configured on the SP, you must find how the SP itself is configured. This you can accomplish by typing:
gustav@sp20:../gustav 17:35:47 !509 $ jm_status -P
Pool 0: Batch_only_SP_nodes
Subpool: BATCH
Node: sp01.ucs.indiana.edu
Node: sp02.ucs.indiana.edu
Node: sp03.ucs.indiana.edu
...
Node: sp45.ucs.indiana.edu
Node: sp46.ucs.indiana.edu
Node: sp47.ucs.indiana.edu
gustav@sp20:../gustav 17:35:51 !510 $
This command interrogates the SP Job Manager, or Resource Manager,
as it is also called. The -P option lists pools of processors
configured into the system. In our case there is just one pool which
comprises 47 P2SC nodes. Since every one of those delivers some 700 MFLOPS peak,
you've got nearly 33 GFLOPS of computing power available.
Now, once you know what's out there, you can ask LoadLeveler how those resources can be accessed. The command that will tell you that is
gustav@sp20:../gustav 17:36:28 !511 $ llclass
Name MaxJobCPU MaxProcCPU Free Max Description
d+hh:mm:ss d+hh:mm:ss Slots Slots
b -1 -1 15 24 long serial jobs
l -1 -1 5 5 large-memory serial jobs
qcd -1 -1 1 1 Quantum Chemistry Division
test 0+00:05:00 0+00:05:00 8 8 5-minute test jobs
q 0+01:00:00 0+01:00:00 2 2 quick serial jobs
a 1+00:00:00 1+00:00:00 4 6 short serial jobs
stat 1+12:00:00 1+12:00:00 3 3 statistics jobs
pa 1+12:00:00 1+12:00:00 12 12 short parallel jobs
math 1+12:00:00 1+12:00:00 3 3 mathematics jobs
pb -1 -1 0 32 long parallel jobs
gustav@sp20:../gustav 17:54:17 !512 $
This time LoadLeveler tells us that we have 10 classes. LoadLeveler classes correspond
closely to queues in systems such as NQS and, indeed, there is a queue
associated with every class.
Class pb has up to 32 slots, of which, according to the listing,
none are available at present. Those 32 slots are 32 job instances.
That is the class allows you to run either up to 32 serial jobs,
or, say, 2 parallel jobs, each running on 16 processors.
Classes b, l, qcd, and pb are CPU-time unlimited.
This means that you can submit, for example, a 32-way parallel job to
class pb that may run forever. This may be rather antisocial,
but LoadLeveler configuration allows you to do just that.
Class test is for test runs only, i.e., for very short jobs, just long
enough to check that your program has been correctly linked and that it runs.
Then we have classes q through math, which are for jobs that take
between 1 day and 1.5 days of CPU time.
In order to find more information about any particular class, you can
call llclass with the -l switch, e.g.:
gustav@sp20:../gustav 17:54:17 !512 $ llclass -l pa
=============== Class pa ==========
Name: pa
priority: 40
admin:
NQS_class: F
NQS_submit:
NQS_query:
max_processors: 8
maxjobs: -1
class_comment: short parallel jobs
wall_clock_limit: -1, -1
job_cpu_limit: 1+12:00:00, -1
cpu_limit: 1+12:00:00, -1
data_limit: -1, -1
core_limit: -1, -1
file_limit: -1, -1
stack_limit: -1, -1
rss_limit: -1, -1
nice: 0
free: 12
maximum: 12
gustav@sp20:../gustav 18:19:58 !513 $
Here you can see that even though there are 12 slots in this class, a maximum
number of processors you can request is 8. The CPU limit is cumulative,
i.e., if you run a job on 8 CPUs and if they all munch CPU time equally,
the CPU time allowance per processor will be 4 hours and 30 minutes.
If you run llclass -l on the test class, you'll see that
it has a higher priority than the pa class. They both run on
the same processors, actually, so if there are two jobs submitted at
the same time, one to pa and the other one to test, it
is the test jobs that will run first - unless users alter
the priorities of those jobs explicitely. A user can do that, but
user priority has a smaller weight usually than a system priority.
How to find out which class runs on which nodes? To do that
you can run the command llstatus:
gustav@sp20:../SP 18:32:22 !544 $ llstatus Name Schedd InQ Act Startd Run LdAvg Idle Arch OpSys libra.ucs.indiana.edu Avail 2 2 Idle 0 0.09 2112 R6000 AIX43 sp01.ucs.indiana.edu Avail 36 8 Run 1 1.10 1 R6000 AIX43 sp02.ucs.indiana.edu Avail 5 4 Run 1 1.02 7801 R6000 AIX43 sp03.ucs.indiana.edu Avail 0 0 Run 1 1.08 2952 R6000 AIX43 ... sp44.ucs.indiana.edu Avail 0 0 Run 1 1.00 9112 R6000 AIX43 sp45.ucs.indiana.edu Avail 0 0 Run 1 1.00 9999 R6000 AIX43 sp46.ucs.indiana.edu Avail 0 0 Busy 2 2.03 9999 R6000 AIX43 sp47.ucs.indiana.edu Avail 0 0 Run 1 1.04 9999 R6000 AIX43 R6000/AIX43 48 machines 44 jobs 43 running Total Machines 48 machines 44 jobs 43 running The Central Manager is defined on sp01.ucs.indiana.edu All machines on the machine_list are present gustav@sp20:../SP 18:32:24 !545 $When called without any options,
llstatus simply lists all machines
under the LoadLeveler management. Observe that although our SP has 47
nodes, LoadLeveler manages 48 machines. The 48th machine is libra.ucs.indiana.edu.
The listing tells you if a machine is busy or idle, what is the average load on
the machine, what is its architecture, operating system, and whether
the LoadLeveler scheduler runs on that node.
When invoked with the -l option, the command llstatus returns
a very detailed listing for each machine that is managed by LoadLeveler.
If you don't want to look at all nodes, you can just select one providing
its name on the command line:
gustav@sp20:../SP 18:38:13 !551 $ llstatus -l sp20
name: "sp20.ucs.indiana.edu"
machine_context:
Running = 0
ScheddAvail = 1
StartdAvail = 1
State = Idle
ScheddState = 0
OpSys = AIX43
Arch = R6000
Machine = sp20.ucs.indiana.edu
START = T
SUSPEND = F
CONTINUE = T
VACATE = F
KILL = F
SYSPRIO = ((ClassSysprio * 100) - QDate)
MACHPRIO = (0 - (1000 * (LoadAvg / Speed)))
VirtualMemory = 105392
EnteredCurrentState = Tue Jan 5 12:17:37 1999
Disk = 13072
Tmp = 197736
KeyboardIdle = 42
LoadAvg = 0.000092
AvailableClasses = { "pa" "test" }
DrainingClasses = { }
DrainedClasses = { }
Pool = 0
Adapter = { "ethernet" "hps_user" "hps_ip" }
ConfiguredClasses = { "pa" "test" }
Feature = { "256MB" "afs" }
ProtocolVersion = 1
CkptVersion = 1
Memory = 256
Max_Starters = 2
ConfigTimeStamp = Tue Jan 5 12:16:32 1999
Cpus = 1
Speed = 3.000000
MasterMachPriority = 0.000000
Subnet = 129.79.7
CustomMetric = 1
ScheddRunning = 0
Pending = 0
Starting = 0
Idle = 0
Unexpanded = 0
Held = 0
Removed = 0
RemovePending = 0
Completed = 2
DependantNotRun = 0
TotalJobs = 0
time_stamp: Tue Jan 12 18:37:42 1999
gustav@sp20:../SP 18:38:19 !552 $
There is quite a lot of information in this listing. In particular you'll
see the entry ConfiguredClasses, which in this case is:
{ "pa" "test" }, and this means that when you submit a job to pa
or to test it may end up running on that node. Or on some other node
that has pa or test in its ConfiguredClasses slot.
It would be good, however, if we could ask LoadLeveler about a particular
class and then find which nodes it runs on. The command llclass
should do that, but it doesn't. So on our system we have our own
local command, which is llconfig and that command prints
a more palatable summary:
gustav@sp20:../SP 18:54:26 !556 $ llconfig
LoadLeveler Configuration on the SP
Total
Node Job Classes Jobs Features
libra q 2 512MB
sp01 l,b 2 512MB
sp02 l,b 2 512MB
sp03 l,pb 2 512MB
sp04 l,pb 2 512MB
sp05 stat,pb 2 256MB gauss glim lisrel prelis rats sas spss tsp
sp06 stat,pb 2 256MB gauss glim rats sas spss tsp
sp07 stat,pb 2 256MB glim rats sas spss tsp
sp08 b,pb 2 256MB
sp09 math,pb 2 256MB lindo lingo maple math matlab
sp10 math,pb 2 256MB lindo lingo maple matlab
sp11 math,pb 2 256MB lindo lingo maple matlab
sp12 b,pb 2 256MB
sp13 b,pb 2 256MB
sp14 b,pb 2 256MB
sp15 b,pb 2 256MB naglib
sp16 b,pb 2 256MB naglib
sp17 pa,test 2 256MB afs
sp18 pa,test 2 256MB afs
sp19 pa,test 2 256MB afs
sp20 pa,test 2 256MB afs
sp21 pa,test 2 256MB afs
sp22 pa,test 2 256MB afs
sp23 pa,test 2 256MB afs
sp24 pa,test 2 256MB afs
sp25 l,qcd 2 512MB
sp26 b,pb 2 256MB bigscr naglib
sp27 b,pb 2 256MB bigscr naglib
sp28 b,pb 2 256MB bigscr naglib
sp29 b,pb 2 256MB bigscr naglib
sp30 b,pb 2 256MB bigscr naglib
sp31 b,pb 2 256MB bigscr naglib
sp32 b,pb 2 256MB bigscr naglib
sp33 b,pb 2 256MB bigscr naglib
sp34 b,pb 2 256MB bigscr naglib
sp35 b,pb 2 256MB bigscr naglib
sp36 b,pb 2 256MB bigscr naglib
sp37 b,pb 2 256MB bigscr naglib
sp38 b,pb 2 256MB bigscr naglib
sp39 b,pb 2 256MB bigscr naglib
sp40 a,pa 2 256MB afs bigscr
sp41 a,pa 2 256MB afs bigscr
sp42 a,pa 2 256MB afs bigscr
sp43 a,pa 2 256MB afs bigscr
sp44 a,pb 2 256MB bigscr
sp45 a,pb 2 256MB
sp46 b,pb 2 256MB bigscr
sp47 b,pb 2 256MB bigscr
Maximum Processor Limits
class pa 8
class test 8
class pb 32
all other classes 1
Memory
42 nodes have 256MB memory. The 6 nodes with 512MB memory can
be selected by feature code, providing the appropriate class is
also specified.
gustav@sp20:../SP 18:55:16 !557 $
To search for a more specific information you can always grep, for example:gustav@sp20:../SP 18:55:16 !557 $ llconfig | grep pa sp17 pa,test 2 256MB afs sp18 pa,test 2 256MB afs sp19 pa,test 2 256MB afs sp20 pa,test 2 256MB afs sp21 pa,test 2 256MB afs sp22 pa,test 2 256MB afs sp23 pa,test 2 256MB afs sp24 pa,test 2 256MB afs sp40 a,pa 2 256MB afs bigscr sp41 a,pa 2 256MB afs bigscr sp42 a,pa 2 256MB afs bigscr sp43 a,pa 2 256MB afs bigscr class pa 8 gustav@sp20:../SP 18:56:19 !558 $And this clearly tells us that class
pa runs on sp17 through sp24 and
then on sp40 through sp43. The listing also tells us that
all those nodes run AFS. They are often used for Computer Science experiments,
and you can expect to find various other goodies installed there soon, e.g.,
DFS, HPSS, and GPFS.