next up previous index
Next: Exercises Up: Writing on MPI Files Previous: Program mkrandpfile

The Discussion

This program is very similar to mkrandfiles. It begins in the usual MPI-ish way, i.e., all processes find about the size of the pool and their own rank number within it. Then process of rank 0 assumes the mastership (which usually just means more work and not much more pay) and reads the command line.

On having detected a command line input error, the master process calls  MPI_Abort:

if (input_error) MPI_Abort(MPI_COMM_WORLD, 1);
and this takes every other process down. We can quit the program in such an abrupt manner in this place because we haven't opened any files yet. This is important. Otherwise, we should really postpone aborting and clean up the mess first. But in this first MPI-IO example of ours we are not going to be particularly fastidious about error handling. This will come later.

If there are no problems with the command line, the master process broadcasts (1) the number of blocks of random integers each process is going to contribute to the file, (2) the length of the file name, which includes also the space for the string termination character.

Having received the latter each process, with the exception of the master process, calls malloc to allocate enough space for the string. Observe that the master process never had to malloc space for the string explicitly. This was done by function getopt internally, when it created the string optarg. Then the master process merely made its own instance of filename point to the same location to which optarg pointed.

Finally, the master process broadcasts (3) the name of the file to other processes.

In the next part of the code:

  number_of_integers = number_of_blocks * BLOCK_SIZE;
  number_of_bytes = sizeof(int) * number_of_integers;
each process calculates the number of random integers it is going to write and the number of bytes it will need to store all these integers in its memory. Here this program differs a little from mkrandfiles. Instead of writing in numerous small chunks, we are going to prepare all number in memory first, and then write the whole lot in a single operation. We will have to allocate sufficient space for the numbers, and so number_of_bytes will become an argument to malloc. This argument must be an integer. You cannot malloc a long long of bytes. UNIX does not support memory above MAX_INT.

But the next three numbers that are computed here:

  total_number_of_integers =
    (long long) pool_size * (long long) number_of_integers;
  total_number_of_bytes =
    (long long) pool_size * (long long) number_of_bytes;
  my_offset = (long long) my_rank * (long long) number_of_bytes;
are all of type long long, which on the IA32 is a 64-bit integer. This is because the total number of bytes written on the file may well exceed the total amount of memory available to a single process, on account of there being many processes in the pool.

The last number, my_offset, will be used to point to a location in our MPI file, which, in general, is going to be longer than 2 GB. The variables my_offset and its sibling my_current_offset are of type:

  MPI_Offset my_offset, my_current_offset;
You have to look up the meaning of this type in /N/hpc/mpich2/include/mpi.h in order to find out that it is long long on the AVIDD system. It doesn't always have to be long long though. It may well be long or just int, depending on how MPI was compiled and what machine it runs on.

Usually you should be able to just refer to this type knowing only that it is an integer of some opaque length. And so we could write:

  total_number_of_integers =
    (MPI_Offset) pool_size * (MPI_Offset) number_of_integers;
  total_number_of_bytes =
    (MPI_Offset) pool_size * (MPI_Offset) number_of_bytes;
  my_offset = (MPI_Offset) my_rank * (MPI_Offset) number_of_bytes;
But if MPI_Offset is not long enough, you will not be able to generate truly large MPI files. And if you don't know what it is, you may have problems writing values of file pointer offsets on standard output, be it for debugging or for other purposes, although there is a macro defined on mpio.h, which is included in mpi.h:
#define LL %lld
and you may be able to use it in calls to printf.

Now we encounter the first MPI-IO function , MPI_File_open:

  MPI_File_open(MPI_COMM_WORLD, filename, 
                MPI_MODE_CREATE | MPI_MODE_RDWR, MPI_INFO_NULL, &fh);
This function opens a file identified by filename on all processes in the MPI_COMM_WORLD communicator. This is a collective function meaning that all processes must do it together and the values of all parameters passed to it must be identical on all processes too.

The third parameter specifies how the file should be opened, e.g., for writing, reading or both, and whether it should be created if it doesn't exist. The modes supported by MPI-2 are as follows:

MPI_MODE_RDONLY
open the file for reading from it only
MPI_MODE_RDWR
open the file for reading and writing
MPI_MODE_WRONLY
open the file for writing to it only
MPI_MODE_CREATE
create the file if it does not exist
MPI_MODE_EXCL
throw an error if you try to create a file that exists already
MPI_MODE_DELETE_ON_CLOSE
delete the file on close - such files are called scratch files and they are used for auxiliary data storage during computations
MPI_MODE_UNIQUE_OPEN
ensure that the file is not going to be opened concurrently elsewhere (e.g., by another communicator)
MPI_MODE_SEQUENTIAL
the file will be accessed sequentially only
MPI_MODE_APPEND
set initial position of all file pointers to the end of the file
The options can be combined with the boolean ``or'', |, operator.

The fourth parameter can be used to give the operating system additional hints about how and where the file should be opened. For example, our current version of HPSS lets SP users open HPSS files using MPI-IO. If you wanted to tell HPSS which class of service the file should be associated with, what annotation string it should have attached to its HPSS data base record, and what ACLs it should have, you would use the info structure to pass all this information. In this case though we don't pass any such data to GPFS and so we use one of the predefined  infos, which is MPI_INFO_NULL.

On successful completetion MPI_Open_file returns  a file handle on fh. This is not the same as a file pointer. It is used a little differently and you always have to check meticulously, whether an MPI-IO function you call wants a file handle (i.e., the value of) or a pointer to it. Another thing that happens is that every process in the MPI_COMM_WORLD communicator gets its own local pointer to the file and all those pointers point to the beginning of the file, unless the MPI_MODE_APPEND option has been used.

Having opened the file, collectively, each process is now going to advance to its own position within it by calling  MPI_File_seek

  MPI_File_seek(fh, my_offset, MPI_SEEK_SET);
This is where we use the my_offset variable, which is of type MPI_Offset. The value of this variable is different for each process, so that when they get to write the data, they'll write it on different portions of the file and without overwriting each-other's territory.

The seek can be performed in one of three ways:

MPI_SEEK_SET
the pointer is set exactly to my_offset
MPI_SEEK_CUR
the pointer is advanced by my_offset from its current position
MPI_SEEK_END
the pointer is advanced by my_offset from the end of the file
In general my_offset can be positive and negative, i.e., you can use it to move in both directions within the file.

In this program we set the pointer explicitly. There are other ways to do this and we'll learn about them later. Whenever you manipulate a pointer explicitly, especially in a parallel program, you must exercise great caution and make sure that you point exactly where you want to point. Otherwise you may end up overwriting your own data. You can check where your pointers are by calling function  MPI_File_get_position:

  MPI_File_get_position(fh, &my_current_offset);
In this program we are going to call this function before and after the write, in order to see how each of the individual file pointer has advanced.

Having opened the file and positioned themselves within it, the processes allocate space for the random integers they are about to write and then generate them:

  junk = (int*) malloc(number_of_bytes);
  srand(28 + my_rank);
  for (i = 0; i < number_of_integers; i++) *(junk + i) = rand();

Now we are ready to perform the write itself, and we time it too:

  start = MPI_Wtime();
  MPI_File_write(fh, junk, number_of_integers, MPI_INT, &status);
  finish = MPI_Wtime();
  io_time = finish - start;
Function  MPI_File_write writes a number_of_integers of objects of type MPI_INT taken from the buffer pointed to by junk on a file whose file handle is fh. The write, for a given process, commences at the place where its file pointer points and the file pointer is advanced as the writing proceeeds.

We can check how many items have indeed been written by inspecting status with function  MPI_Get_count.

  MPI_Get_count(&status, MPI_INT, &count);
The variable status is of type  MPI_Status, and it is the same status that is returned, e.g., by  MPI_Recv. You can see here how nicely MPI-IO fits with MPI.

After the write we call MPI_File_get_position again:

  MPI_File_get_position(fh, &my_current_offset);
and the new positions are printed on standard output. Recall the following:
  0: my current offset is 0
  1: my current offset is 1073741824
  2: my current offset is 2147483648
  3: my current offset is 3221225472
  4: my current offset is 4294967296

[...]

  0: wrote 268435456 integers
  0: my current offset is 1073741824
  1: wrote 268435456 integers
  1: my current offset is 2147483648
  2: wrote 268435456 integers
  2: my current offset is 3221225472
  3: wrote 268435456 integers
  3: my current offset is 4294967296
  4: wrote 268435456 integers
  4: my current offset is 5368709120
You can see that the initial offset of process 0 was 0 and after the write it is 1073741824. But this was the initial offset of process 1. Whereas after the write process 1 advanced to 2147483648. But this was the initial offset of process 2, which after the write advanced to 3221225472$\ldots$ and so on. The file has indeed been written rather tightly. Each process wrote on its own portion of it.

Now we can close the file by calling  MPI_File_close

  MPI_File_close(&fh);
This file merely disposes of the file handle. Once you have closed the file, you can no longer refer to it by using fh.

The last part of the program calls  MPI_Allreduce to find the longest IO time that any process spent writing the data, and this time is then used by the master process to estimate the data transfer rate:

  MPI_Allreduce(&io_time, &longest_io_time, 1, MPI_DOUBLE, MPI_MAX,
                MPI_COMM_WORLD);

  if (i_am_the_master) {
    printf("longest_io_time       = %f seconds\n", longest_io_time);
    printf("total_number_of_bytes = %lld\n", total_number_of_bytes);
    printf("transfer rate         = %f MB/s\n", 
           total_number_of_bytes / longest_io_time / MBYTE);
  }
whereupon all processes meet at MPI_Finalize and exit.


next up previous index
Next: Exercises Up: Writing on MPI Files Previous: Program mkrandpfile
Zdzislaw Meglicki
2004-04-29