next up previous index
Next: Specification of PBS Jobs Up: Working with PBS Previous: PBS Configuration

Submitting, Inspecting and Cancelling PBS Jobs

In this section we are going to submit our first PBS job.

A PBS job is simply a shell script, possibly with some PBS directives. PBS directives look to shell like shell comments, so it ignores them, but PBS picks them up and processes the job accordingly.

We are going to start by submitting a very simple shell script, which doesn't have any PBS directives.

[gustav@bh1 PBS]$ pwd
/N/B/gustav/PBS
[gustav@bh1 PBS]$ ls
job.sh
[gustav@bh1 PBS]$ cat job.sh
#!/bin/bash
hostname
date
exit 0
[gustav@bh1 PBS]$

The shell is submitted with  the command qsub:

[gustav@bh1 PBS]$ qsub job.sh
12248.bh1.avidd.iu.edu
[gustav@bh1 PBS]$
The command returns the job id in return. This is the id (but with the domain name stripped off) that appears in the first column of qstat listing. qsub returns the standard output produced by the job on a file in the same directory from the job was submitted. Standard error is also returned on another file in the same directory:
[gustav@bh1 PBS]$ ls
job.sh  job.sh.e12248  job.sh.o12248
[gustav@bh1 PBS]$
The name of the output file is produced by appending ``.o'' followed by the job id number to the name of the script submitted to qsub. The name of the error file is produced by appending ``.e'' followed by the job id number to the name of the script. In this case the error file is empty and the standard output file contains:
[gustav@bh1 PBS]$ cat job.sh.o12248
bc89
Sun Sep  7 16:27:25 EST 2003
[gustav@bh1 PBS]$
and this tells us that the job was executed on bc89.avidd.iu.edu.

Now let me show you how you can submit a very large number of jobs automatically. We begin by executing the following simple multiline shell expression:

[gustav@bh1 PBS]$ i=10; while [ $i -gt 0 ]
> do
>    echo $i
>    i=`expr $i - 1`
> done
10
9
8
7
6
5
4
3
2
1
[gustav@bh1 PBS]$
The expression works as follows. First we initialize i to 10. Then we start the while loop, which checks if the value of i, i.e., $i is still greater than zero. Within the body of the loop we print the value of i and then we decrement it by 1. The loop stops when the value of i becomes zero.

Now we are going to repeat the same multiline command, but this time we are going to insert qsub job.sh within the body of the loop:

[gustav@bh1 PBS]$ i=10; while [ $i -gt 0 ]
> do
>    echo $i
>    qsub job.sh
>    i=`expr $i - 1`
> done
10
12249.bh1.avidd.iu.edu
9
12250.bh1.avidd.iu.edu
8
12251.bh1.avidd.iu.edu
7
12252.bh1.avidd.iu.edu
6
12253.bh1.avidd.iu.edu
5
12254.bh1.avidd.iu.edu
4
12255.bh1.avidd.iu.edu
3
12256.bh1.avidd.iu.edu
2
12257.bh1.avidd.iu.edu
1
12258.bh1.avidd.iu.edu
[gustav@bh1 PBS]$
These jobs are very small and they should execute very quickly:
[gustav@bh1 PBS]$ ls
0              job.sh.e12251  job.sh.e12256  job.sh.o12250  job.sh.o12255
job.sh         job.sh.e12252  job.sh.e12257  job.sh.o12251  job.sh.o12256
job.sh.e12248  job.sh.e12253  job.sh.e12258  job.sh.o12252  job.sh.o12257
job.sh.e12249  job.sh.e12254  job.sh.o12248  job.sh.o12253  job.sh.o12258
job.sh.e12250  job.sh.e12255  job.sh.o12249  job.sh.o12254
[gustav@bh1 PBS]$
Let us see what we can find in the output files:
[gustav@bh1 PBS]$ cat job.sh.o*
bc89
Sun Sep  7 16:27:25 EST 2003
bc89
Sun Sep  7 16:43:18 EST 2003
bc68
Sun Sep  7 16:43:20 EST 2003
bc67
Sun Sep  7 16:43:20 EST 2003
bc67
Sun Sep  7 16:43:21 EST 2003
bc66
Sun Sep  7 16:43:22 EST 2003
bc65
Sun Sep  7 16:43:23 EST 2003
bc65
Sun Sep  7 16:43:24 EST 2003
bc63
Sun Sep  7 16:43:25 EST 2003
bc61
Sun Sep  7 16:43:26 EST 2003
bc61
Sun Sep  7 16:43:27 EST 2003
[gustav@bh1 PBS]$
We can see that the jobs were sent to various nodes, depending on which were available.

You can easily submit thousands of jobs this way. But it is better to write a job submitting script first, then test it with the qsub line commented out, just to make sure that your counting and, most importantly, stopping is going to work as you expect, and only then uncomment the qsub line and run the script again.

Our job executes so fast that we can hardly catch it in action. We are going to slow it down by letting  it sleep for a hundred seconds before exiting.

Here is our modified version.

[gustav@bh1 PBS]$ cat job.sh
#!/bin/bash
hostname
date
sleep 100
date
exit 0
[gustav@bh1 PBS]$
And just to make sure that it's not going to hang forever, we are going to execute it interactively and check that it sleeps for 100 seconds only:
[gustav@bh1 PBS]$ time ./job.sh
bh1
Sun Sep  7 16:53:41 EST 2003
Sun Sep  7 16:55:21 EST 2003

real    1m40.029s
user    0m0.000s
sys     0m0.010s
[gustav@bh1 PBS]$
This works just fine: the job took 1 minute and 40 seconds, which is 100 seconds, to execute. Now we are going to submit it with qsub and we are going to look at it with qstat.
[gustav@bh1 PBS]$ qsub job.sh
12259.bh1.avidd.iu.edu
[gustav@bh1 PBS]$ qstat 12259.bh1
Job id           Name             User             Time Use S Queue
---------------- ---------------- ---------------- -------- - -----
12259.bh1        job.sh           gustav                  0 R bg              
[gustav@bh1 PBS]$
You can look at just the job that is of interest to you, by giving its ID as an  argument to qstat. The ``R'' in the 5th column indicates that the job is running. The 5th column is the status column the other values you can see there are
E
the job is exiting after having run
H
the job is held - this means that it is not going to run until it is released
Q
the job is queued and will run when the resources become available
R
the job is running
T
the job is being transferred to a new location - this may happen, e.g., if the node the job had been running on crashed
W
the job is waiting - you can submit jobs to run, e.g., after 5PM

Now, suppose that we have submitted a job and it is waiting in the queue. Its status is going to be Q. Then we discover that there may be a problem with the job, but we aren't sure. What are we to do? Well, we can ``place a hold'' on the job by issuing the command  qhold:

[gustav@bh1 PBS]$ qsub job.sh
12259.bh1.avidd.iu.edu
[gustav@bh1 PBS]$ qhold 12259.bh1
If the job is still in the queue, it'll be put on hold. Its status will change from Q to H. You can now check if the submitted program and data are OK, and if they are, you can release the hold by calling  qrls:
[gustav@bh1 PBS]$ qrls 12259.bh1
This will make the job eligible to run again. Its status will change from H back to Q.

Another scenario arises when the job that you may have some doubts about is already running. You can send signals to jobs running under PBS by calling the command  qsig. The synopsis of qsig is

qsig -s signal job_id
The signal can be given either as a number, e.g., 2, or as a signal name, e.g, SIGINT. Numbers and names of signals are described in section 7 of the Linux manual. You can read  the corresponding manual entry by issuing the command:
[gustav@bh1 gustav]$ man 7 signal

In the following example, we are going to submit our job that sleeps for 100 seconds and then we are going to send an interrupt signal to it (it is signal number 2 that is called SIGINT). This signal is generated by pressing control-C on normally configured Linux keyboard.

[gustav@bh1 PBS]$ qsub job.sh
12351.bh1.avidd.iu.edu
[gustav@bh1 PBS]$ qsig -s SIGINT 12351.bh1
[gustav@bh1 PBS]$ ls
job.sh  job.sh.e12351  job.sh.o12351
[gustav@bh1 PBS]$ cat job.sh.o12351
bc67
Sun Sep  7 17:19:35 EST 2003
[gustav@bh1 PBS]$ qstat 12351.bh1
qstat: Unknown Job Id 12351.bh1.avidd.iu.edu
[gustav@bh1 PBS]$
The job has indeed been killed.

It is usually better to send signals other than SIGKILL (signal number 9), because SIGKILL cannot be caught  and normally you want to catch a signal and act on it, e.g., clean the mess before exiting.

A more ruthless way to get rid of an unwanted job is to  run qdel on it. This command deletes a job from PBS. If the job runs, the command sends SIGKILL to it. If the job is merely queued, the command deletes it from the queue. Here's the example:

[gustav@bh1 PBS]$ qsub job.sh
12390.bh1.avidd.iu.edu
[gustav@bh1 PBS]$ qdel 12390.bh1
[gustav@bh1 PBS]$ ls
job.sh  job.sh.e12390  job.sh.o12390
[gustav@bh1 PBS]$ cat job.sh.o12390
bc89
Sun Sep  7 17:26:01 EST 2003
[gustav@bh1 PBS]$

PBS commands covered in this section
qsub
submit a job for execution
qstat
examine the status of a job (we have discussed what this status may be)
qhold
put a job on hold
qrls
release a job
qsig
send a signal to a job
qdel
delete a job


next up previous index
Next: Specification of PBS Jobs Up: Working with PBS Previous: PBS Configuration
Zdzislaw Meglicki
2004-04-29