In this section we are going to do what we have done in section 4.3.4, but we will use PBS facility for defining job dependencies instead. We will have four scripts as before, but the scripts will be simpler and they will not submit other scripts. Instead we are going to tell PBS how our jobs depend on other jobs, so that PBS will wait for the first job to finish before it will release the second job. Then PBS will wait for the second job to finish, before the third job gets released, and so on. All jobs will be submitted at the same time from a single shell script.
The four jobs, first_1.sh, second_1.sh,
third_1.sh and fourth_1.sh look the same as
the jobs in section 4.3.4, first.sh,
second.sh, verb|third.sh| and fourth.sh,
with the exception that the job submission lines were commented
out. The real trickery is in the shell script that does
the submissions. Here is the script:
[gustav@bh1 PBS]$ cat submit_1 #!/bin/bash FIRST=`qsub first_1.sh` echo $FIRST SECOND=`qsub -W depend=afterok:$FIRST second_1.sh` echo $SECOND THIRD=`qsub -W depend=afterok:$SECOND third_1.sh` echo $THIRD FOURTH=`qsub -W depend=afterok:$THIRD fourth_1.sh` echo $FOURTH exit 0 [gustav@bh1 PBS]$Command
qsub returns the job ID and this is normally printed
on standard output. Here we capture the output of qsub in
variables FIRST, SECOND, THIRD and FOURTH.
The second job is submitted with option-W depend=afterok:$FIRSTThis means that the job itself is going to be put on hold until the first job has completed with no errors. Only then the second job is going to be released. The third and fourth jobs are treated similarly.
Let us run the script and see what happens:
[gustav@bh1 PBS]$ ./submit_1
13876.bh1.avidd.iu.edu
13877.bh1.avidd.iu.edu
13878.bh1.avidd.iu.edu
13879.bh1.avidd.iu.edu
[gustav@bh1 PBS]$ qstat | grep gustav
13876.bh1 first gustav 0 Q bg
13877.bh1 second gustav 0 H bg
13878.bh1 third gustav 0 H bg
13879.bh1 fourth gustav 0 H bg
[gustav@bh1 PBS]$ qstat -f 13878.bh1
Job Id: 13878.bh1.avidd.iu.edu
Job_Name = third
Job_Owner = gustav@bh1.avidd.iu.edu
job_state = H
queue = bg
server = bh1.avidd.iu.edu
Checkpoint = u
ctime = Sat Sep 13 14:33:26 2003
depend = afterok:13877.bh1.avidd.iu.edu@bh1.avidd.iu.edu,
beforeok:13879.bh1.avidd.iu.edu@bh1.avidd.iu.edu
Error_Path = bh1.avidd.iu.edu:/N/B/gustav/PBS/third_err
Hold_Types = s
Join_Path = oe
Keep_Files = n
Mail_Points = a
mtime = Sat Sep 13 14:33:26 2003
Output_Path = bh1.avidd.iu.edu:/N/B/gustav/PBS/third_out
Priority = 0
qtime = Sat Sep 13 14:33:26 2003
Rerunable = True
Resource_List.ncpus = 1
Resource_List.nodect = 1
Resource_List.nodes = 1
Resource_List.walltime = 00:30:00
Shell_Path_List = /bin/bash
Variable_List = PBS_O_HOME=/N/B/gustav,PBS_O_LOGNAME=gustav,
PBS_O_PATH=/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/b
in:/usr/local/gm/bin:/usr/lpp/mmfs/bin:/opt/intel/compiler70/ia32/bin:/
usr/local/maui/bin:/usr/pbs/bin:/usr/pbs/sbin:/opt/pgi/linux86/bin:/N/h
pc/totalview/bin:/opt/xcat/bin:/opt/xcat/sbin:/opt/xcat/i686/bin:/opt/x
cat/i686/sbin:/N/B/gustav/bin,PBS_O_MAIL=/var/spool/mail/gustav,
PBS_O_SHELL=/bin/bash,PBS_O_HOST=bh1.avidd.iu.edu,
PBS_O_WORKDIR=/N/B/gustav/PBS,PBS_O_QUEUE=bg
[gustav@bh1 PBS]$
We have generated four jobs, which were all submitted at roughly
the same time. But only the first job is queued, whereas the remaining
three jobs are on hold. Requesting the full listing
of the third job
with qstat -f shows the dependency:
depend = afterok:13877.bh1.avidd.iu.edu@bh1.avidd.iu.edu,
beforeok:13879.bh1.avidd.iu.edu@bh1.avidd.iu.edu
The job can be started only after 13877 has completed without
errors. Observe that PBS has recognized another dependency, which
I have not specified explicitly. Namely that after this job, 13878,
has completed without errors, then job 13879 should be started, i.e.,
that there is another job that depends on this one.
The dependency is specified by using the -W option to
qsub. The option is generally used for additional
attributes, of which dependency is one. The word
depend that flags this attribute must be followed by
a list of jobs on which the submitted job depends
qualified with types of dependencies, e.g.,
-W depend=afterok:13876.bh1.avidd.iu.edu:13877.bh1.avidd.iu.eduHere we state that the job can be released from hold only after two preceding jobs,
13876.bh1.avidd.iu.edu and
13877.bh1.avidd.iu.edu, have completed their run without errors.
The jobs get released one after another. This can be seen by running
qstat every now and then:
[gustav@bh1 PBS]$ qstat | grep gustav 13878.bh1 third gustav 00:00:05 R bg 13879.bh1 fourth gustav 0 H bg [gustav@bh1 PBS]$
Eventually everything completes and we are left with four logs in the PBS directory:
[gustav@bh1 PBS]$ cat *_out /N/gpfs/gustav prepared and cleaned. Directory /N/gpfs/gustav cleaned. writing on test writing 1000 blocks of 1048576 random integers real 2m42.813s user 0m40.040s sys 0m14.760s -rw-r--r-- 1 gustav ucs 4194304000 Sep 13 14:44 test File /N/gpfs/gustav/test generated. reading test reading in chunks of size 16777216 bytes allocated 16777216 bytes to junk read 4194304000 bytes real 2m51.521s user 0m0.000s sys 0m12.600s File /N/gpfs/gustav/test processed. [gustav@bh1 PBS]$Observe that the IO on writing is 32 MB/s and only 23 MB/s on reading. This illustrates yet again how much IO can vary depending on the system load and configuration.