next up previous index
Next: Checkpointing and Resubmission Up: Jobs Dependent On Other Previous: A Multi-Job Job

PBS Dependency Lists

In this section we are going to do what we have done in section 4.3.4, but we will use PBS facility for defining job dependencies instead. We will have four scripts as before, but the scripts will be simpler and they will not submit other scripts. Instead we are going to tell PBS how our jobs depend on other jobs, so that PBS will wait for the first job to finish before it will release the second job. Then PBS will wait for the second job to finish, before the third job gets released, and so on. All jobs will be submitted at the same time from a single shell script.

The four jobs, first_1.sh, second_1.sh, third_1.sh and fourth_1.sh look the same as the jobs in section 4.3.4, first.sh, second.sh, verb|third.sh| and fourth.sh, with the exception that the job submission lines were commented out. The real trickery is in the shell script that does the submissions. Here is the script:

[gustav@bh1 PBS]$ cat submit_1
#!/bin/bash
FIRST=`qsub first_1.sh`
echo $FIRST
SECOND=`qsub -W depend=afterok:$FIRST second_1.sh`
echo $SECOND
THIRD=`qsub -W depend=afterok:$SECOND third_1.sh`
echo $THIRD
FOURTH=`qsub -W depend=afterok:$THIRD fourth_1.sh`
echo $FOURTH
exit 0
[gustav@bh1 PBS]$
Command qsub returns the job ID and this is normally printed on standard output. Here we capture the output of qsub in variables FIRST, SECOND, THIRD and FOURTH. The second job is submitted  with option
-W depend=afterok:$FIRST
This means that the job itself is going to be put on hold until the first job has completed with no errors. Only then the second job is going to be released. The third and fourth jobs are treated similarly.

Let us run the script and see what happens:

[gustav@bh1 PBS]$ ./submit_1
13876.bh1.avidd.iu.edu
13877.bh1.avidd.iu.edu
13878.bh1.avidd.iu.edu
13879.bh1.avidd.iu.edu
[gustav@bh1 PBS]$ qstat | grep gustav
13876.bh1        first            gustav                  0 Q bg              
13877.bh1        second           gustav                  0 H bg              
13878.bh1        third            gustav                  0 H bg              
13879.bh1        fourth           gustav                  0 H bg              
[gustav@bh1 PBS]$ qstat -f 13878.bh1
Job Id: 13878.bh1.avidd.iu.edu
    Job_Name = third
    Job_Owner = gustav@bh1.avidd.iu.edu
    job_state = H
    queue = bg
    server = bh1.avidd.iu.edu
    Checkpoint = u
    ctime = Sat Sep 13 14:33:26 2003
    depend = afterok:13877.bh1.avidd.iu.edu@bh1.avidd.iu.edu,
        beforeok:13879.bh1.avidd.iu.edu@bh1.avidd.iu.edu
    Error_Path = bh1.avidd.iu.edu:/N/B/gustav/PBS/third_err
    Hold_Types = s
    Join_Path = oe
    Keep_Files = n
    Mail_Points = a
    mtime = Sat Sep 13 14:33:26 2003
    Output_Path = bh1.avidd.iu.edu:/N/B/gustav/PBS/third_out
    Priority = 0
    qtime = Sat Sep 13 14:33:26 2003
    Rerunable = True
    Resource_List.ncpus = 1
    Resource_List.nodect = 1
    Resource_List.nodes = 1
    Resource_List.walltime = 00:30:00
    Shell_Path_List = /bin/bash
    Variable_List = PBS_O_HOME=/N/B/gustav,PBS_O_LOGNAME=gustav,
        PBS_O_PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/b
        in:/usr/local/gm/bin:/usr/lpp/mmfs/bin:/opt/intel/compiler70/ia32/bin:/
        usr/local/maui/bin:/usr/pbs/bin:/usr/pbs/sbin:/opt/pgi/linux86/bin:/N/h
        pc/totalview/bin:/opt/xcat/bin:/opt/xcat/sbin:/opt/xcat/i686/bin:/opt/x
        cat/i686/sbin:/N/B/gustav/bin,PBS_O_MAIL=/var/spool/mail/gustav,
        PBS_O_SHELL=/bin/bash,PBS_O_HOST=bh1.avidd.iu.edu,
        PBS_O_WORKDIR=/N/B/gustav/PBS,PBS_O_QUEUE=bg

[gustav@bh1 PBS]$
We have generated four jobs, which were all submitted at roughly the same time. But only the first job is queued, whereas the remaining three jobs are on hold. Requesting the full listing  of the third job with qstat -f shows the dependency:
    depend = afterok:13877.bh1.avidd.iu.edu@bh1.avidd.iu.edu,
        beforeok:13879.bh1.avidd.iu.edu@bh1.avidd.iu.edu
The job can be started only after 13877 has completed without errors. Observe that PBS has recognized another dependency, which I have not specified explicitly. Namely that after this job, 13878, has completed without errors, then job 13879 should be started, i.e., that there is another job that depends on this one.

The dependency is specified by using the -W option to qsub. The option is generally used for additional attributes, of which dependency is one. The word depend that flags this attribute must be followed by a list of jobs on which the submitted job depends qualified with types of dependencies, e.g.,

-W depend=afterok:13876.bh1.avidd.iu.edu:13877.bh1.avidd.iu.edu
Here we state that the job can be released from hold only after two preceding jobs, 13876.bh1.avidd.iu.edu and 13877.bh1.avidd.iu.edu, have completed their run without errors.

The jobs get released one after another. This can be seen by running qstat every now and then:

[gustav@bh1 PBS]$ qstat | grep gustav
13878.bh1        third            gustav           00:00:05 R bg              
13879.bh1        fourth           gustav                  0 H bg              
[gustav@bh1 PBS]$

Eventually everything completes and we are left with four logs in the PBS directory:

[gustav@bh1 PBS]$ cat *_out
/N/gpfs/gustav prepared and cleaned.
Directory /N/gpfs/gustav cleaned.
writing on test
writing 1000 blocks of 1048576 random integers

real    2m42.813s
user    0m40.040s
sys     0m14.760s
-rw-r--r--    1 gustav   ucs      4194304000 Sep 13 14:44 test
File /N/gpfs/gustav/test generated.
reading test
reading in chunks of size 16777216 bytes
allocated 16777216 bytes to junk
read 4194304000 bytes

real    2m51.521s
user    0m0.000s
sys     0m12.600s
File /N/gpfs/gustav/test processed.
[gustav@bh1 PBS]$
Observe that the IO on writing is 32 MB/s and only 23 MB/s on reading. This illustrates yet again how much IO can vary depending on the system load and configuration.


next up previous index
Next: Checkpointing and Resubmission Up: Jobs Dependent On Other Previous: A Multi-Job Job
Zdzislaw Meglicki
2004-04-29