This program is very similar to mkrandfiles. It begins in
the usual MPI-ish way, i.e., all processes find about the size of
the pool and their own rank number within it. Then process of
rank 0 assumes the mastership (which usually just means more work
and not much more pay) and reads the command line.
On having detected a command line input error, the master process calls MPI_Abort:
if (input_error) MPI_Abort(MPI_COMM_WORLD, 1);and this takes every other process down. We can quit the program in such an abrupt manner in this place because we haven't opened any files yet. This is important. Otherwise, we should really postpone aborting and clean up the mess first. But in this first MPI-IO example of ours we are not going to be particularly fastidious about error handling. This will come later.
If there are no problems with the command line, the master process broadcasts (1) the number of blocks of random integers each process is going to contribute to the file, (2) the length of the file name, which includes also the space for the string termination character.
Having received the latter
each process, with the exception of the master process, calls
malloc to allocate enough space for the string. Observe that
the master process never had to malloc space for the string
explicitly. This was done by function getopt internally,
when it created the string optarg. Then the master process
merely made its own instance of filename point to the
same location to which optarg pointed.
Finally, the master process broadcasts (3) the name of the file to other processes.
In the next part of the code:
number_of_integers = number_of_blocks * BLOCK_SIZE; number_of_bytes = sizeof(int) * number_of_integers;each process calculates the number of random integers it is going to write and the number of bytes it will need to store all these integers in its memory. Here this program differs a little from
mkrandfiles. Instead of writing in
numerous small chunks, we are going to prepare all number in memory
first, and then write the whole lot in a single operation.
We will have to allocate sufficient space for the numbers, and
so number_of_bytes will become an argument to malloc.
This argument must be an integer. You cannot malloc
a long long of bytes. UNIX does not support memory above
MAX_INT.
But the next three numbers that are computed here:
total_number_of_integers =
(long long) pool_size * (long long) number_of_integers;
total_number_of_bytes =
(long long) pool_size * (long long) number_of_bytes;
my_offset = (long long) my_rank * (long long) number_of_bytes;
are all of type long long, which on the IA32 is a 64-bit
integer. This is because the total number of bytes written
on the file may well exceed the total amount of memory available
to a single process, on account of there being many processes in
the pool.
The last number, my_offset, will be used to point to
a location in
our MPI file, which, in general, is going to be longer than
2 GB. The variables my_offset and its sibling my_current_offset
are of type:
MPI_Offset my_offset, my_current_offset;You have to look up the meaning of this type in
/N/hpc/mpich2/include/mpi.h in order to find out that it is
long long on the AVIDD system. It doesn't always have to be
long long though. It may well be long or just int,
depending on how MPI was compiled and what machine it runs on.
Usually you should be able to just refer to this type knowing only that it is an integer of some opaque length. And so we could write:
total_number_of_integers =
(MPI_Offset) pool_size * (MPI_Offset) number_of_integers;
total_number_of_bytes =
(MPI_Offset) pool_size * (MPI_Offset) number_of_bytes;
my_offset = (MPI_Offset) my_rank * (MPI_Offset) number_of_bytes;
But if MPI_Offset is not long enough, you will not be
able to generate truly large MPI files. And if you don't know what
it is, you may have problems writing values of file pointer offsets
on standard output, be it for debugging or for other purposes, although
there is a macro defined on mpio.h, which is included in mpi.h:#define LL %lldand you may be able to use it in calls to
printf.
Now we encounter the first MPI-IO function , MPI_File_open:
MPI_File_open(MPI_COMM_WORLD, filename,
MPI_MODE_CREATE | MPI_MODE_RDWR, MPI_INFO_NULL, &fh);
This function opens a file identified by filename on all processes
in the MPI_COMM_WORLD communicator. This is a collective
function meaning that all processes must do it together and the
values of all parameters passed to it must be identical on
all processes too.
The third parameter specifies how the file should be opened, e.g., for writing, reading or both, and whether it should be created if it doesn't exist. The modes supported by MPI-2 are as follows:
|, operator.
The fourth parameter can be used to give the operating system additional
hints about how and where the file should be opened. For example,
our current version of HPSS lets SP users open HPSS files using MPI-IO.
If you wanted to tell HPSS which class of service the file should be
associated with, what annotation string it should have attached to
its HPSS data base record, and what ACLs it should have,
you would use the info structure
to pass all this information. In this case though we don't pass
any such data to GPFS and so we use one of the
predefined
infos,
which is MPI_INFO_NULL.
On successful completetion MPI_Open_file returns
a file handle on fh. This is not the same as
a file pointer. It is used a little differently and you always have
to check meticulously, whether an MPI-IO function you call wants a
file handle (i.e., the value of) or a pointer to it. Another thing
that happens is that every process in the MPI_COMM_WORLD
communicator gets its own local pointer to the file and all those
pointers point to the beginning of the file, unless the MPI_MODE_APPEND
option has been used.
Having opened the file, collectively, each process is now going to advance to its own position within it by calling MPI_File_seek
MPI_File_seek(fh, my_offset, MPI_SEEK_SET);This is where we use the
my_offset variable, which is
of type MPI_Offset. The value of this variable is different
for each process, so that when they get to write the data, they'll
write it on different portions of the file and without overwriting
each-other's territory.
The seek can be performed in one of three ways:
my_offset
my_offset
from its current position
my_offset
from the end of the file
my_offset can be positive and negative, i.e., you
can use it to move in both directions within the file.
In this program we set the pointer explicitly. There are other ways to do this and we'll learn about them later. Whenever you manipulate a pointer explicitly, especially in a parallel program, you must exercise great caution and make sure that you point exactly where you want to point. Otherwise you may end up overwriting your own data. You can check where your pointers are by calling function MPI_File_get_position:
MPI_File_get_position(fh, &my_current_offset);In this program we are going to call this function before and after the write, in order to see how each of the individual file pointer has advanced.
Having opened the file and positioned themselves within it, the processes allocate space for the random integers they are about to write and then generate them:
junk = (int*) malloc(number_of_bytes); srand(28 + my_rank); for (i = 0; i < number_of_integers; i++) *(junk + i) = rand();
Now we are ready to perform the write itself, and we time it too:
start = MPI_Wtime(); MPI_File_write(fh, junk, number_of_integers, MPI_INT, &status); finish = MPI_Wtime(); io_time = finish - start;Function MPI_File_write writes a
number_of_integers of objects of type MPI_INT
taken from the buffer pointed to by junk on a file whose
file handle is fh. The write, for a given process,
commences at the place where its file pointer points and the file
pointer is advanced as the writing proceeeds.
We can check how many items have indeed been written by inspecting
status with function
MPI_Get_count.
MPI_Get_count(&status, MPI_INT, &count);The variable
status is of type
MPI_Status, and it is the same status that is
returned, e.g., by
MPI_Recv.
You can see here how nicely MPI-IO fits with MPI.
After the write we call MPI_File_get_position again:
MPI_File_get_position(fh, &my_current_offset);and the new positions are printed on standard output. Recall the following:
0: my current offset is 0 1: my current offset is 1073741824 2: my current offset is 2147483648 3: my current offset is 3221225472 4: my current offset is 4294967296 [...] 0: wrote 268435456 integers 0: my current offset is 1073741824 1: wrote 268435456 integers 1: my current offset is 2147483648 2: wrote 268435456 integers 2: my current offset is 3221225472 3: wrote 268435456 integers 3: my current offset is 4294967296 4: wrote 268435456 integers 4: my current offset is 5368709120You can see that the initial offset of process 0 was 0 and after the write it is 1073741824. But this was the initial offset of process 1. Whereas after the write process 1 advanced to 2147483648. But this was the initial offset of process 2, which after the write advanced to 3221225472
Now we can close the file by calling MPI_File_close
MPI_File_close(&fh);This file merely disposes of the file handle. Once you have closed the file, you can no longer refer to it by using
fh.
The last part of the program calls MPI_Allreduce to find the longest IO time that any process spent writing the data, and this time is then used by the master process to estimate the data transfer rate:
MPI_Allreduce(&io_time, &longest_io_time, 1, MPI_DOUBLE, MPI_MAX,
MPI_COMM_WORLD);
if (i_am_the_master) {
printf("longest_io_time = %f seconds\n", longest_io_time);
printf("total_number_of_bytes = %lld\n", total_number_of_bytes);
printf("transfer rate = %f MB/s\n",
total_number_of_bytes / longest_io_time / MBYTE);
}
whereupon all processes meet at MPI_Finalize and exit.