In this section we are going to split the job discussed in section 4.3.4 into four separate jobs. The first job will prepare the GPFS directory and having finished its task, it will submit the second job. The second job will then generate the data file, and having done so it will submit the third job. The third job will process the data file and then it will submit the fourth job, which will will clean up and exit the sequence. The jobs are constructed to be run on the IUPUI cluster, avidd-i.iu.edu.
Here is what the first job script looks like:
[gustav@ih1 PBS]$ cat first.sh #PBS -S /bin/bash #PBS -N first #PBS -o first_out #PBS -e first_err #PBS -q bg # # first.sh # # Prepare a directory on the AVIDD GPFS. [ -d /N/gpfs/gustav ] || mkdir /N/gpfs/gustav cd /N/gpfs/gustav rm -f test echo "/N/gpfs/gustav prepared and cleaned." # Now submit second.sh. ssh ih1 "cd PBS; /usr/pbs/bin/qsub second.sh" echo "second.sh submitted." # Exit cleanly. exit 0 [gustav@ih1 PBS]$The new element in this job is the line:
ssh ih1 "cd PBS; /usr/pbs/bin/qsub second.sh"Remember that the job will not run on the head node. It will run on a computational node. But the PBS on the AVIDD cluster is configured so that you cannot submit jobs from computational nodes. So here we have to execute
qsub as a remote command on the IUPUI head
node ih1 by using
the secure shell, since this is the only remote execution shell supported
on the cluster.
The first command passed to ssh is ``cd PBS''.
On having made the connection
the secure shell will land me in my home directory. But I don't want to submit
the job from there, because then the job output and error files will
be generated in my home directory too. Instead I want all output
and error files to be written on my ~/PBS subdirectory. So we
go to ~/PBS first.
Then we submit the job. Observe that I use
the full path name of the qsub command. The default bash
configuration on the AVIDD cluster is such that the remote shell
cannot find qsub otherwise. This, of course, I could fix
by tweaking my own environment until it does (the PATH should normally
be defined on .bashrc, not on .bash_profile),
but it is a good practice
to specify the full path of the command in this context anyway.
The script second.sh submitted by first.sh looks as follows:
[gustav@ih1 PBS]$ cat second.sh #PBS -S /bin/bash #PBS -N second #PBS -o second_out #PBS -e second_err #PBS -q bg #PBS -j oe # # second.sh # # The AVIDD GPFS directory should have been prepared by first.sh. # Generate the data file. cd /N/gpfs/gustav time mkrandfile -f test -l 1000 ls -l test echo "File /N/gpfs/gustav/test generated." # Now submit third.sh. ssh ih1 "cd PBS; /usr/pbs/bin/qsub third.sh" echo "third.sh submitted." # Exit cleanly. exit 0 [gustav@ih1 PBS]$There is only one novelty in this script, which you haven't seen yet. I am using a new PBS directive:
#PBS -j oeThis directive merges the standard error and standard output and writes both on the standard output file. If we were to use
#PBS -j eothe two streams would be merged too, and the output would be written on the standard error file instead.
The reason I want both streams merged in this case is because the
UNIX command time writes its diagnostics,
i.e., the amount of CPU and wall clock time used by the program,
on standard error. But I want this to be written together with the
length of the file generated on standard output, in case I want to
check the IO.
After this script has finished generating the file, it will submit
the third script, called third.sh. Here is what the third
script looks like:
[gustav@ih1 PBS]$ cat third.sh #PBS -S /bin/bash #PBS -N third #PBS -o third_out #PBS -e third_err #PBS -q bg #PBS -j oe # # third.sh # # Process the data file generated by second.sh. cd /N/gpfs/gustav time xrandfile -f test -l 4 echo "File /N/gpfs/gustav/test processed." # Submit fourth.sh. ssh ih1 "cd PBS; /usr/pbs/bin/qsub fourth.sh" echo "fourth.sh submitted." # Exit cleanly. exit 0 [gustav@ih1 PBS]$Here I have also requested that standard error and standard output streams be merged.
And finally the last, fourth script, which is called fourth.sh:
[gustav@ih1 PBS]$ cat fourth.sh #PBS -S /bin/bash #PBS -N fourth #PBS -o fourth_out #PBS -e fourth_err #PBS -q bg # # fourth.sh # # Clean up everything in the GPFS directory cd /N/gpfs/gustav rm -f test echo "Directory /N/gpfs/gustav cleaned." exit 0 [gustav@ih1 PBS]$
Here is how to work all this. You submit the whole sequence on the
IUPUI head node ih1 by submitting just the first of the four
scripts. The rest takes care of itself:
[gustav@ih1 PBS]$ qsub first.sh 13658.ih1.avidd.iu.edu [gustav@ih1 PBS]$ while sleep 10 > do > qstat | grep gustav > done 13659.ih1 second gustav 0 R bg ... 13659.ih1 second gustav 00:00:26 R bg ... 13659.ih1 second gustav 00:00:46 R bg ... 13675.ih1 third gustav 0 R bg ... ^C [gustav@ih1 PBS]$ ls Makefile first_err fourth_err nodes.sh second_out third_out bc.sh first_out fourth_out process.sh simple.sh xterm.sh first.sh fourth.sh job.sh second.sh third.sh [gustav@ih1 PBS]$ cat first_out /N/gpfs/gustav prepared and cleaned. 13659.ih1.avidd.iu.edu second.sh submitted. [gustav@ih1 PBS]$ cat second_out writing on test writing 1000 blocks of 1048576 random integers real 5m8.000s user 0m39.280s sys 0m17.240s -rw-r--r-- 1 gustav ucs 4194304000 Sep 13 13:28 test File /N/gpfs/gustav/test generated. 13675.ih1.avidd.iu.edu third.sh submitted. [gustav@ih1 PBS]$ cat third_out reading test reading in chunks of size 16777216 bytes allocated 16777216 bytes to junk read 4194304000 bytes real 0m42.039s user 0m0.020s sys 0m10.730s File /N/gpfs/gustav/test processed. 13678.ih1.avidd.iu.edu fourth.sh submitted. [gustav@ih1 PBS]$ cat fourth_out Directory /N/gpfs/gustav cleaned. [gustav@ih1 PBS]$Observe that IO is better on the computational nodes than on the head node. The reading program
xrandfile, which has very little
computation (the user CPU time is only 0.02s), returns
transfer rate of 95 MB/s (remember that
Why do things this way?
If you have a very long job that can be divided into multiple separate tasks that can execute separately it is usually a good idea to do so. In case something goes wrong and the system crashes, or has to be taken down for maintenance, you won't lose the whole lot. In fact, the jobs may simply run without any problems at all, and the maintenance schedule will simply slide in between. Furthermore, if the system is configured so that there are restrictions on the wall clock time or CPU time consumed by PBS jobs (wall clock time restrictions make more sense in this context than CPU time restrictions - can you figure out why?), you may not be able to fit everything into a single job.