next up previous index
Next: Exercises Up: Reading from MPI Files Previous: Program xrandpfile

The Discussion

This program begins the same way mkrandpfile did, until we get to

  file_open_error = MPI_File_open(MPI_COMM_WORLD, filename, 
                                  MPI_MODE_RDONLY, MPI_INFO_NULL, &fh);

  if (file_open_error != MPI_SUCCESS) {

    char error_string[BUFSIZ];
    int length_of_error_string, error_class;

    MPI_Error_class(file_open_error, &error_class);
    MPI_Error_string(error_class, error_string, &length_of_error_string);
    printf("%3d: %s\n", my_rank, error_string);

    MPI_Error_string(file_open_error, error_string, &length_of_error_string);
    printf("%3d: %s\n", my_rank, error_string);

    MPI_Abort(MPI_COMM_WORLD, file_open_error);
This time we open the file for reading only and we check what function MPI_File_open has returned. If there is no problem, i.e., file_open_error == MPI_SUCCESS, then we go ahead and read the file. But if there is a problem, we convert file_open_error to error messages, print them on standard output and MPI_Abort.

Assuming that the MPI_File_open worked, we need to find out how much data has to be read. So we check the size of the file by calling  MPI_File_get_size

  MPI_File_get_size(fh, &total_number_of_bytes);
where total_number_of_bytes must be of type MPI_Offset, i.e., in our case, long long.

Now we evaluate how much data every process needs to read:

  number_of_bytes_ll = total_number_of_bytes / pool_size;

  /* If pool_size does not divide total_number_of_bytes evenly,
     the last process will have to read more data, i.e., to the
     end of the file. */

  max_number_of_bytes_ll = 
    number_of_bytes_ll + total_number_of_bytes % pool_size;
Depending on the length of the file and the number of processes, the division of the former by the latter may or may not be exact. If it isn't then max_number_of_bytes_ll is going to be a little larger than number_of_bytes_ll. We will make the last process read more. Observe that both number_of_bytes_ll and max_number_of_bytes_ll are long long. At this stage we don't know if they'll fit in int.

Now we have the if statement:

  if (max_number_of_bytes_ll < INT_MAX) {
     blah... blah... blah...
  else {
    if (i_am_the_master) {
      printf("Not enough memory to read the file.\n");
      printf("Consider running on more nodes.\n");
  } /* of if(max_number_of_bytes_ll < INT_MAX) */

This statement checks, right at the top, if max_number_of_bytes_ll is going to fit into int, because we are going to read the data the same way we wrote it, i.e., in one large gasp into a single sufficiently long array. If max_number_of_bytes_ll is too large, then we close the file right away.

Now let's see what happens inside the top clause of the if statement.

First each process converts number_of_bytes_ll to a normal integer suitable for passing to malloc with the exception of the last process, which does it to max_number_of_bytes_ll, and then they all call malloc:

    if (my_rank == last_guy)
      number_of_bytes = (int) max_number_of_bytes_ll;
      number_of_bytes = (int) number_of_bytes_ll;

    read_buffer = (char*) malloc(number_of_bytes);
Now every process figures out its own offset in the file and goes there:
    my_offset = (MPI_Offset) my_rank * number_of_bytes_ll;
#ifdef DEBUG
    printf("%3d: my offset = %lld\n", my_rank, my_offset);
    MPI_File_seek(fh, my_offset, MPI_SEEK_SET);

and then they all meet at the barrier.

Now we are ready to commence the read, to time it, and to find if and how the pointers have advanced as the result of it:

    start = MPI_Wtime();
    MPI_File_read(fh, read_buffer, number_of_bytes, MPI_BYTE, &status);
    finish = MPI_Wtime();
    MPI_Get_count(&status, MPI_BYTE, &count);
#ifdef DEBUG
    printf("%3d: read %d bytes\n", my_rank, count);
    MPI_File_get_position(fh, &my_offset);
#ifdef DEBUG
    printf("%3d: my offset = %lld\n", my_rank, my_offset);
Function  MPI_File_read read number_of_bytes of items of type MPI_BYTE into the read_buffer from the file given by the file handle fh. Every process reads the data beginning from the position it is at as the result of the call to MPI_File_seek, and as the reading progresses, its own pointer moves accordingly.

Let us have a look at the positions of the pointers before and after the reading:

  0: total_number_of_bytes = 34359738368
  0: allocated 1073741824 bytes
  0: my offset = 0
  1: total_number_of_bytes = 34359738368
  1: allocated 1073741824 bytes
  1: my offset = 1073741824
  2: total_number_of_bytes = 34359738368
  2: allocated 1073741824 bytes
  2: my offset = 2147483648
  3: total_number_of_bytes = 34359738368
  3: allocated 1073741824 bytes
  3: my offset = 3221225472
  4: total_number_of_bytes = 34359738368
  4: allocated 1073741824 bytes
  4: my offset = 4294967296


  0: read 1073741824 bytes
  0: my offset = 1073741824
  1: read 1073741824 bytes
  1: my offset = 2147483648
  2: read 1073741824 bytes
  2: my offset = 3221225472
  3: read 1073741824 bytes
  3: my offset = 4294967296
  4: read 1073741824 bytes
  4: my offset = 5368709120
Observe that process 0 started at offset 0 and progressed to offset 1073741824 having read exactly 1073741824 bytes. Process 1 started at offset 1073741824 and progressed to offset 2147483648, which is exactly where process 2 started from. In short, we have read every byte from the file, not missing anything, not even the last couple of bytes, in case the length of the file does not divide by the number of processes. The last process is going to mop them up.

Now we check what the bandwidth was the same way we did it for mkrandpfile:

    io_time = finish - start;
    MPI_Allreduce(&io_time, &longest_io_time, 1, MPI_DOUBLE, MPI_MAX,
    if (i_am_the_master) {
      printf("longest_io_time       = %f seconds\n", longest_io_time);
      printf("total_number_of_bytes = %lld\n", total_number_of_bytes);
      printf("transfer rate         = %f MB/s\n", 
             total_number_of_bytes / longest_io_time / MBYTE);

And this is it. The processes all go to MPI_Finalize and exit.

next up previous index
Next: Exercises Up: Reading from MPI Files Previous: Program xrandpfile
Zdzislaw Meglicki