In this section we are going to submit our first PBS job.
A PBS job is simply a shell script, possibly with some PBS directives. PBS directives look to shell like shell comments, so it ignores them, but PBS picks them up and processes the job accordingly.
We are going to start by submitting a very simple shell script, which doesn't have any PBS directives.
[gustav@bh1 PBS]$ pwd /N/B/gustav/PBS [gustav@bh1 PBS]$ ls job.sh [gustav@bh1 PBS]$ cat job.sh #!/bin/bash hostname date exit 0 [gustav@bh1 PBS]$
The shell is submitted with the command qsub:
[gustav@bh1 PBS]$ qsub job.sh 12248.bh1.avidd.iu.edu [gustav@bh1 PBS]$The command returns the job id in return. This is the id (but with the domain name stripped off) that appears in the first column of
qstat listing. qsub returns the standard output produced
by the job on a file in the same directory from the job was submitted. Standard
error is also returned on another file in the same directory:[gustav@bh1 PBS]$ ls job.sh job.sh.e12248 job.sh.o12248 [gustav@bh1 PBS]$The name of the output file is produced by appending ``.o'' followed by the job id number to the name of the script submitted to
qsub.
The name of the error file is produced by appending ``.e'' followed
by the job id number to the name of the script. In this case the
error file is empty and the standard output file contains:[gustav@bh1 PBS]$ cat job.sh.o12248 bc89 Sun Sep 7 16:27:25 EST 2003 [gustav@bh1 PBS]$and this tells us that the job was executed on bc89.avidd.iu.edu.
Now let me show you how you can submit a very large number of jobs automatically. We begin by executing the following simple multiline shell expression:
[gustav@bh1 PBS]$ i=10; while [ $i -gt 0 ] > do > echo $i > i=`expr $i - 1` > done 10 9 8 7 6 5 4 3 2 1 [gustav@bh1 PBS]$The expression works as follows. First we initialize
i to 10.
Then we start the while loop, which checks if the value of
i, i.e., $i is still greater than zero. Within the
body of the loop we print the value of i and then we decrement
it by 1. The loop stops when the value of i becomes zero.
Now we are going to repeat the same multiline command, but this time
we are going to insert qsub job.sh within the body of the loop:
[gustav@bh1 PBS]$ i=10; while [ $i -gt 0 ] > do > echo $i > qsub job.sh > i=`expr $i - 1` > done 10 12249.bh1.avidd.iu.edu 9 12250.bh1.avidd.iu.edu 8 12251.bh1.avidd.iu.edu 7 12252.bh1.avidd.iu.edu 6 12253.bh1.avidd.iu.edu 5 12254.bh1.avidd.iu.edu 4 12255.bh1.avidd.iu.edu 3 12256.bh1.avidd.iu.edu 2 12257.bh1.avidd.iu.edu 1 12258.bh1.avidd.iu.edu [gustav@bh1 PBS]$These jobs are very small and they should execute very quickly:
[gustav@bh1 PBS]$ ls 0 job.sh.e12251 job.sh.e12256 job.sh.o12250 job.sh.o12255 job.sh job.sh.e12252 job.sh.e12257 job.sh.o12251 job.sh.o12256 job.sh.e12248 job.sh.e12253 job.sh.e12258 job.sh.o12252 job.sh.o12257 job.sh.e12249 job.sh.e12254 job.sh.o12248 job.sh.o12253 job.sh.o12258 job.sh.e12250 job.sh.e12255 job.sh.o12249 job.sh.o12254 [gustav@bh1 PBS]$Let us see what we can find in the output files:
[gustav@bh1 PBS]$ cat job.sh.o* bc89 Sun Sep 7 16:27:25 EST 2003 bc89 Sun Sep 7 16:43:18 EST 2003 bc68 Sun Sep 7 16:43:20 EST 2003 bc67 Sun Sep 7 16:43:20 EST 2003 bc67 Sun Sep 7 16:43:21 EST 2003 bc66 Sun Sep 7 16:43:22 EST 2003 bc65 Sun Sep 7 16:43:23 EST 2003 bc65 Sun Sep 7 16:43:24 EST 2003 bc63 Sun Sep 7 16:43:25 EST 2003 bc61 Sun Sep 7 16:43:26 EST 2003 bc61 Sun Sep 7 16:43:27 EST 2003 [gustav@bh1 PBS]$We can see that the jobs were sent to various nodes, depending on which were available.
You can easily submit thousands of jobs this way. But it is better to
write a job submitting script first, then test it with the qsub
line commented out, just to make sure that your counting and,
most importantly, stopping is going to work as you expect,
and only then uncomment the qsub line and run the script
again.
Our job executes so fast that we can hardly catch it in action. We are going to slow it down by letting it sleep for a hundred seconds before exiting.
Here is our modified version.
[gustav@bh1 PBS]$ cat job.sh #!/bin/bash hostname date sleep 100 date exit 0 [gustav@bh1 PBS]$And just to make sure that it's not going to hang forever, we are going to execute it interactively and check that it sleeps for 100 seconds only:
[gustav@bh1 PBS]$ time ./job.sh bh1 Sun Sep 7 16:53:41 EST 2003 Sun Sep 7 16:55:21 EST 2003 real 1m40.029s user 0m0.000s sys 0m0.010s [gustav@bh1 PBS]$This works just fine: the job took 1 minute and 40 seconds, which is 100 seconds, to execute. Now we are going to submit it with
qsub and we are going to look at it with qstat.[gustav@bh1 PBS]$ qsub job.sh 12259.bh1.avidd.iu.edu [gustav@bh1 PBS]$ qstat 12259.bh1 Job id Name User Time Use S Queue ---------------- ---------------- ---------------- -------- - ----- 12259.bh1 job.sh gustav 0 R bg [gustav@bh1 PBS]$You can look at just the job that is of interest to you, by giving its ID as an argument to
qstat. The ``R'' in the 5th column
indicates that the job is running. The 5th column is the status
column the other values you can see there are
Now, suppose that we have submitted a job and it is waiting
in the queue. Its status is going to be Q. Then we discover
that there may be a problem with the job, but we aren't sure.
What are we to do? Well, we can ``place a hold'' on the job by
issuing the command
qhold:
[gustav@bh1 PBS]$ qsub job.sh 12259.bh1.avidd.iu.edu [gustav@bh1 PBS]$ qhold 12259.bh1If the job is still in the queue, it'll be put on hold. Its status will change from
Q to H. You can
now check if the submitted program and data are OK,
and if they are, you can release
the hold by calling
qrls:[gustav@bh1 PBS]$ qrls 12259.bh1This will make the job eligible to run again. Its status will change from
H back to Q.
Another scenario arises when the job that you may have some doubts about
is already running. You can send signals to jobs running under PBS
by calling the command
qsig. The synopsis of qsig is
qsig -s signal job_idThe signal can be given either as a number, e.g., 2, or as a signal name, e.g,
SIGINT. Numbers and names of signals are described
in section 7 of the Linux manual. You can read
the corresponding
manual entry by issuing the command:[gustav@bh1 gustav]$ man 7 signal
In the following example, we are going to submit our job that
sleeps for 100 seconds and then we are going to send an interrupt
signal to it (it is signal number 2 that is called SIGINT).
This signal is generated by pressing control-C on normally
configured Linux keyboard.
[gustav@bh1 PBS]$ qsub job.sh 12351.bh1.avidd.iu.edu [gustav@bh1 PBS]$ qsig -s SIGINT 12351.bh1 [gustav@bh1 PBS]$ ls job.sh job.sh.e12351 job.sh.o12351 [gustav@bh1 PBS]$ cat job.sh.o12351 bc67 Sun Sep 7 17:19:35 EST 2003 [gustav@bh1 PBS]$ qstat 12351.bh1 qstat: Unknown Job Id 12351.bh1.avidd.iu.edu [gustav@bh1 PBS]$The job has indeed been killed.
It is usually better to send signals other than SIGKILL (signal
number 9), because SIGKILL cannot be
caught and normally
you want to catch a signal and act on it, e.g., clean the mess
before exiting.
A more ruthless way to get rid of an unwanted job is to
run qdel on it. This command deletes a job from PBS. If the
job runs, the command sends SIGKILL to it. If the job is
merely queued, the command deletes it from the queue.
Here's the example:
[gustav@bh1 PBS]$ qsub job.sh 12390.bh1.avidd.iu.edu [gustav@bh1 PBS]$ qdel 12390.bh1 [gustav@bh1 PBS]$ ls job.sh job.sh.e12390 job.sh.o12390 [gustav@bh1 PBS]$ cat job.sh.o12390 bc89 Sun Sep 7 17:26:01 EST 2003 [gustav@bh1 PBS]$
PBS commands covered in this section
- qsub
- submit a job for execution
- qstat
- examine the status of a job (we have discussed what this status may be)
- qhold
- put a job on hold
- qrls
- release a job
- qsig
- send a signal to a job
- qdel
- delete a job