next up previous index
Next: A Multi-Job Job Up: Jobs Dependent On Other Previous: A Four Part Job

   
Submitting Jobs From Within Jobs

In this section we are going to split the job discussed in section 4.3.4 into four separate jobs. The first job will prepare the GPFS directory and having finished its task, it will submit the second job. The second job will then generate the data file, and having done so it will submit the third job. The third job will process the data file and then it will submit the fourth job, which will will clean up and exit the sequence. The jobs are constructed to be run on the IUPUI cluster, avidd-i.iu.edu.

Here is what the first job script looks like:

[gustav@ih1 PBS]$ cat first.sh
#PBS -S /bin/bash
#PBS -N first
#PBS -o first_out
#PBS -e first_err
#PBS -q bg
#
# first.sh
#
# Prepare a directory on the AVIDD GPFS.
[ -d /N/gpfs/gustav ] || mkdir /N/gpfs/gustav
cd /N/gpfs/gustav
rm -f test
echo "/N/gpfs/gustav prepared and cleaned."
# Now submit second.sh.
ssh ih1 "cd PBS; /usr/pbs/bin/qsub second.sh"
echo "second.sh submitted."
# Exit cleanly.
exit 0
[gustav@ih1 PBS]$
The new element in this job is the line:
ssh ih1 "cd PBS; /usr/pbs/bin/qsub second.sh"
Remember that the job will not run on the head node. It will run on a computational node. But the PBS on the AVIDD cluster is configured so that you cannot submit jobs from computational nodes. So here we have to execute qsub as a remote command on the IUPUI head node ih1 by using the secure shell, since this is the only remote execution shell supported on the cluster.

The first command passed to ssh is ``cd PBS''. On having made the connection the secure shell will land me in my home directory. But I don't want to submit the job from there, because then the job output and error files will be generated in my home directory too. Instead I want all output and error files to be written on my ~/PBS subdirectory. So we go to ~/PBS first.

Then we submit the job. Observe that I use the full path name of the qsub command. The default bash configuration on the AVIDD cluster is such that the remote shell cannot find qsub otherwise. This, of course, I could fix by tweaking my own environment until it does (the PATH should normally be defined on .bashrc, not on .bash_profile), but it is a good practice to specify the full path of the command in this context anyway.

The script second.sh submitted by first.sh looks as follows:

[gustav@ih1 PBS]$ cat second.sh
#PBS -S /bin/bash
#PBS -N second
#PBS -o second_out
#PBS -e second_err
#PBS -q bg
#PBS -j oe
#
# second.sh
#
# The AVIDD GPFS directory should have been prepared by first.sh.
# Generate the data file.
cd /N/gpfs/gustav
time mkrandfile -f test -l 1000
ls -l test
echo "File /N/gpfs/gustav/test generated."
# Now submit third.sh.
ssh ih1 "cd PBS; /usr/pbs/bin/qsub third.sh"
echo "third.sh submitted."
# Exit cleanly.
exit 0
[gustav@ih1 PBS]$
There is only one novelty in this script, which you haven't seen yet. I am using a new PBS  directive:
#PBS -j oe
This directive merges the standard error and standard output and writes both on the standard output file. If we were to use
#PBS -j eo
the two streams would be merged too, and the output would be written on the standard error file instead.

The reason I want both streams merged in this case is because the UNIX  command time writes its diagnostics, i.e., the amount of CPU and wall clock time used by the program, on standard error. But I want this to be written together with the length of the file generated on standard output, in case I want to check the IO.

After this script has finished generating the file, it will submit the third script, called third.sh. Here is what the third script looks like:

[gustav@ih1 PBS]$ cat third.sh
#PBS -S /bin/bash
#PBS -N third
#PBS -o third_out
#PBS -e third_err
#PBS -q bg
#PBS -j oe
#
# third.sh
#
# Process the data file generated by second.sh.
cd /N/gpfs/gustav
time xrandfile -f test -l 4
echo "File /N/gpfs/gustav/test processed."
# Submit fourth.sh.
ssh ih1 "cd PBS; /usr/pbs/bin/qsub fourth.sh"
echo "fourth.sh submitted."
# Exit cleanly.
exit 0
[gustav@ih1 PBS]$
Here I have also requested that standard error and standard output streams be merged.

And finally the last, fourth script, which is called fourth.sh:

[gustav@ih1 PBS]$ cat fourth.sh
#PBS -S /bin/bash
#PBS -N fourth
#PBS -o fourth_out
#PBS -e fourth_err
#PBS -q bg
#
# fourth.sh
#
# Clean up everything in the GPFS directory
cd /N/gpfs/gustav
rm -f test
echo "Directory /N/gpfs/gustav cleaned."
exit 0
[gustav@ih1 PBS]$

Here is how to work all this. You submit the whole sequence on the IUPUI head node ih1 by submitting just the first of the four scripts. The rest takes care of itself:

[gustav@ih1 PBS]$ qsub first.sh
13658.ih1.avidd.iu.edu
[gustav@ih1 PBS]$ while sleep 10
> do
>    qstat | grep gustav
> done
13659.ih1        second           gustav                  0 R bg              
...
13659.ih1        second           gustav           00:00:26 R bg              
...
13659.ih1        second           gustav           00:00:46 R bg              
...
13675.ih1        third            gustav                  0 R bg              
...
^C
[gustav@ih1 PBS]$ ls
Makefile  first_err  fourth_err  nodes.sh    second_out  third_out
bc.sh     first_out  fourth_out  process.sh  simple.sh   xterm.sh
first.sh  fourth.sh  job.sh      second.sh   third.sh
[gustav@ih1 PBS]$ cat first_out
/N/gpfs/gustav prepared and cleaned.
13659.ih1.avidd.iu.edu
second.sh submitted.
[gustav@ih1 PBS]$ cat second_out
writing on test
writing 1000 blocks of 1048576 random integers

real    5m8.000s
user    0m39.280s
sys     0m17.240s
-rw-r--r--    1 gustav   ucs      4194304000 Sep 13 13:28 test
File /N/gpfs/gustav/test generated.
13675.ih1.avidd.iu.edu
third.sh submitted.
[gustav@ih1 PBS]$ cat third_out
reading test
reading in chunks of size 16777216 bytes
allocated 16777216 bytes to junk
read 4194304000 bytes

real    0m42.039s
user    0m0.020s
sys     0m10.730s
File /N/gpfs/gustav/test processed.
13678.ih1.avidd.iu.edu
fourth.sh submitted.
[gustav@ih1 PBS]$ cat fourth_out
Directory /N/gpfs/gustav cleaned.
[gustav@ih1 PBS]$
Observe that IO is better on the computational nodes than on the head node. The reading program xrandfile, which has very little computation (the user CPU time is only 0.02s), returns transfer rate of 95 MB/s (remember that $1\,\textrm{MB}
= 2^{20}\,\textrm{B} = 1048576\,\textrm{B}$).

Why do things this way?

If you have a very long job that can be divided into multiple separate tasks that can execute separately it is usually a good idea to do so. In case something goes wrong and the system crashes, or has to be taken down for maintenance, you won't lose the whole lot. In fact, the jobs may simply run without any problems at all, and the maintenance schedule will simply slide in between. Furthermore, if the system is configured so that there are restrictions on the wall clock time or CPU time consumed by PBS jobs (wall clock time restrictions make more sense in this context than CPU time restrictions - can you figure out why?), you may not be able to fit everything into a single job.


next up previous index
Next: A Multi-Job Job Up: Jobs Dependent On Other Previous: A Four Part Job
Zdzislaw Meglicki
2004-04-29