next up previous index
Next: Exercises Up: File Views Previous: Program darray.c

The Discussion

The program begins with the usual incantations to MPI, which result in every process learning about its rank number and the size of the pool of processes. The master process, which doesn't really do any mastering in this program, but occasionally speaks for all and reads the command line, learns about being a master.

The command line analysis is done with getopt and there is one more option here, -v, which activates verbose output, by setting a boolean variable verbose to TRUE.

Next the master program broadcasts the content of input_error to all processes, so that they can all go directly to MPI_Finalize if there is such, otherwise the action begins. The master process broadcasts verbose, file_name_length and file_name to other processes.

Having received this information all processes promptly get to prepare themselves for calling  MPI_Type_create_darray

    ndims = NDIMS; 
    for (i = 0; i < ndims; i++) {
      array_of_gsizes[i] = SIZE;
      array_of_distribs[i] = MPI_DISTRIBUTE_BLOCK;
      array_of_dargs[i] = MPI_DISTRIBUTE_DFLT_DARG;
      array_of_psizes[i] = 0;
    }
    MPI_Dims_create(pool_size, ndims, array_of_psizes);
    order = MPI_ORDER_C;
The number of dimensions both of the process grid, which is yet to be constructed and of the data array itself is set to NDIMS, which is defined to be 3. So all these arrays used in the call to MPI_Type_create_darray are going to be of length 3. The first array, array_of_gsizes, specifies the sizes of the global data array, and we set it here to be $512\times512\times512$, because this is what SIZE is defined to be. All entries in the array_of_distribs are set to  MPI_DISTRIBUTE_BLOCK, which means that we want block distribution, rather than cyclic distribution of data amongst the processes. Since the block distribution request doesn't take any arguments, all entries in the array_of_dargs are set to  MPI_DISTRIBUTE_DFLT_DARG. Finally we set all entries to the array that specifies the process grid geometry to zeros. This is because this array is going to be configured by the call to  MPI_Dims_create. This function divides a process pool into an ndims-dimensional grid of processes, and returns the number of processes in each direction of the grid on array_of_psizes. The total number of processes must be such that they can be indeed organized into a grid. Otherwise this function will return an error. The order parameter is set  to MPI_ORDER_C, which means that the global data matrix is going to be organized in the C-language style, i.e., row-major.

If the program is run in the verbose mode, all processes write the values of all these parameters on standard output, for example:

  4: calling MPI_Type_create_darray with
  4:    pool_size         = 27
  4:    my_rank           = 4
  4:    ndims             = 3
  4:    array_of_gsizes   = (512, 512, 512)
  4:    array_of_distribs = (121, 121, 121)
  4:    array_of_dargs    = (-49767, -49767, -49767)
  4:    array_of_psizes   = (3, 3, 3)
  4:    order             = 56
  4:    type              = 1275069445
You have to lookup /N/hpc/mpich2/include/mpi.h to see that, e.g., -49767 is indeed MPI_DISTRIBUTE_DFLT_DARG and that 121 is indeed MPI_DISTRIBUTE_BLOCK and that 56 is indeed MPI_ORDER_C and that 1275069445 is indeed MPI_INT.

Now the program finally calls  MPI_Type_create_darray and commits the new returned MPI data type called file_type:

    MPI_Type_create_darray(pool_size, my_rank, ndims,
                           array_of_gsizes, array_of_distribs,
                           array_of_dargs, array_of_psizes, order,
                           MPI_INT, &file_type);
    MPI_Type_commit(&file_type);
There are two important numbers that characterize this new MPI type, its extent and its size. The extent of an MPI type is the distance in bytes between the upper and the lower marker of the type. The size of an MPI type is the total amount of non-trivial data contained in the type, also in bytes. In other words, the size is the extent of the type minus the padding. We can extract these two numbers from the newly defined type by calling functions MPI_Type_extent  and MPI_Type_size :
    MPI_Type_extent(file_type, &file_type_extent);
    MPI_Type_size(file_type, &file_type_size);
    if (verbose) {
      printf("%3d: file_type_size   = %d\n", my_rank, file_type_size);
      printf("%3d: file_type_extent = %d\n", my_rank, file_type_extent);
    }
It is instructive to inspect the output of the program in this place:
[...]
  8: file_type_size   = 19767600
  8: file_type_extent = 536870912
  4: file_type_size   = 20000844
  4: file_type_extent = 536870912
[...]
Observe that the extent of file_type is 536,870,912 bytes, which is the size of the file:
gustav@bh1 $ ls -l /N/gpfs/gustav/darray
total 524288
-rw-r--r--    1 gustav   ucs      536870912 Oct 26 17:49 test
gustav@bh1 $
but the size of file_type differs from process to process and corresponds to the amount of data this process is going to write on the file, when the view is established. This is how the processes divide the file amongst themselves. This is done without having to manipulate processes' local pointers explicitly.
But there is a price to this procedure, which makes it somewhat impractical on systems such as the AVIDD cluster. The price is that the second argument in function MPI_Type_extent must be of type MPI_Aint, and MPI_Aint is defined on the 32-bit systems to be simply int, because the idea here is that you ought to be able to absorb a data item of this particular type into the memory of your computer. This means that function MPI_Type_create_darray as well as MPI_Type_extent will fail if the total amount of data you want to partition exceeds  INT_MAX, i.e., 2,147,483,647 bytes. This, however, is very little data by supercomputer standards. It is in places like this one that the limitations of 32-bit architectures can cause real pain. Luckily, we have a 64-bit system available to us at IU, it is our Research SP. There is also a small component of the AVIDD cluster at IUPUI, which comprises IA64 nodes.

Once data partitioning has been returned to the processes, they can allocate required amount of storage for the data:

    write_buffer_size = file_type_size / sizeof(int);
    write_buffer = (int*) malloc(write_buffer_size * sizeof(int));

    /* We do this in case sizeof(int) does not divide file_type_size 
       exactly. But this should not happen if we have called 
       MPI_Type_create_darray with MPI_INT as the original data
       type. */

    if (! write_buffer) {
      sprintf(message, "%3d: malloc write_buffer", my_rank);
      perror(message);
      MPI_Abort(MPI_COMM_WORLD, errno);

      /* We can still abort, because we have not opened any
         files yet. Notice that since MPI_Type_create_darray 
         will fail if SIZE^3 * sizeof(int) exceeds MAX_INT,
         because MPI_Aint on AVIDD is a 32-bit integer,
         we are rather unlikely to fail on this malloc
         anyway. */
    }

    MPI_Barrier(MPI_COMM_WORLD); 
    /* We wait here in case some procs have problems with malloc. */
Observe that we are not going to proceed with opening the file until all processes meet at the barrier, implying that none had problems with malloc. Now they fill their memory buffers with numbers:
    for (i = 0; i < write_buffer_size; i++) 
      *(write_buffer + i) = my_rank * SIZE + i;
Observe that this will result in each process filling its buffer with different numbers. We need this in order to check, towards the end of the program, that data read back from the file is identical to data that has been written on it in the first place.

Now we open the file and immediately check for a possible problem:

    file_open_error = MPI_File_open(MPI_COMM_WORLD, file_name, 
                                    MPI_MODE_CREATE | MPI_MODE_WRONLY,
                                    MPI_INFO_NULL, &fh);
    if (file_open_error != MPI_SUCCESS) {
      MPI_Error_string(file_open_error, error_string,
                       &error_string_length);
      fprintf(stderr, "%3d: %s\n", my_rank, error_string);
      MPI_Abort(MPI_COMM_WORLD, file_open_error);

      /* It is still OK to abort, because we have failed to
         open the file. */

    }
If there is a problem on open we can still abort the program without having to worry about clean-up. Function MPI_File_open is collective, which means that if anybody has a problem opening the file, error will be returned to all of them - and the file won't be opened.

The rest of the program is contains within the else clause of the if statement:

    else {
      blah... blah... blah...
    } /* no problem with file open */
  } /* no problem with input error */
  MPI_Finalize();
  exit(0);
}
The first thing that happens within this clause is that the master process changes permissions on the successfully opened file from rw-rw-rw- to rw-r--r--, while other processes wait on the barrier:
      if (i_am_the_master)
         chmod(file_name, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
      MPI_Barrier(MPI_COMM_WORLD);
This shouldn't be necessary, because MPI provides means for changing permissions on a newly created file by the means of info hints, but MPICH2 hasn't implemented this part of the standard, so we have to relay on UNIX  to do this ourselves.

Now we convert our MPI type file_type into an official view of the file:

      MPI_File_set_view(fh, 0, MPI_INT, file_type, "native", MPI_INFO_NULL);
and write the data. This time we call the collective form of MPI_File_write, which is  MPI_File_write_all. It works much like MPI_File_write, but this time all processes must do it at the same time, and any local errors become global errors automatically. Consequently, if any of the processes fails to write the data, all will know about it and all will enter the following clause of the if statement:
      file_write_error =
        MPI_File_write_all(fh, write_buffer, write_buffer_size, MPI_INT, 
                           &status);
      if (file_write_error != MPI_SUCCESS) {
        MPI_Error_string(file_write_error, error_string,
                         &error_string_length);
        fprintf(stderr, "%3d: %s\n", my_rank, error_string);
        MPI_File_close(&fh);
        free(write_buffer);
        if (i_am_the_master) MPI_File_delete(file_name, MPI_INFO_NULL);
      }
If there is an error, all processes close the file - this is a collective call too, and then the master process deletes it.

The rest of the program is again in the form of the else statement:

      else {
         blah... blah... blah...
      } /* no problem with file write */
    } /* no problem with file open */
  } /* no input error */

  MPI_Finalize();
  exit(0);
}

The first thing that happens here is that every process inspects the status returned by MPI_File_write_all for the number of data written on the file (it should be the same as intended, but this is a yet another way to check it), and the size of the whole file. These numbers should agree with numbers obtained from the call to MPI_Type_construct_darray:

        MPI_Get_count(&status, MPI_INT, &count);
        MPI_File_get_size(fh, &file_size);
        if(verbose) {
          printf("%3d: wrote %d integers\n", my_rank, count);
          printf("%3d: file size is %lld bytes\n", my_rank, file_size);
        }
and afterwards the file gets closed:
        MPI_File_close(&fh);

The last part of the program opens the file again, but this time for reading. We allocate space for the reading buffer first, check if there are no problems with malloc and then open the file for reading only:

        read_buffer_size = write_buffer_size;
        read_buffer = (int*) malloc(read_buffer_size * sizeof(int));
        if (! read_buffer) {
          sprintf(message, "%3d: malloc read_buffer", my_rank);
          perror(message);
          MPI_Abort(MPI_COMM_WORLD, errno);

          /* We can abort, because the file has been closed and
             we haven't opened it for reading yet. */

        }

        MPI_Barrier(MPI_COMM_WORLD);
        /* We wait here in case some procs have problems with malloc. */

        MPI_File_open(MPI_COMM_WORLD, file_name, MPI_MODE_RDONLY,
                      MPI_INFO_NULL, &fh);
Next we establish the same view of the file as before, read all data back from it, every process checks how much data has really been read by inspecting the status, and then the file gets closed again:
        MPI_File_set_view(fh, 0, MPI_INT, file_type, "native", MPI_INFO_NULL);
        MPI_File_read_all(fh, read_buffer, read_buffer_size, MPI_INT, &status);
        MPI_Get_count(&status, MPI_INT, &count);
        if (verbose)
           printf("%3d: read %d integers\n", my_rank, count);
        MPI_File_close(&fh);
Now every process compares its read_buffer with its write_buffer, which we have wisely saved, and the result of the comparison is reduced on the variable read_error maintained by the master process:
        for (i = 0; i < read_buffer_size; i++) {
           if (*(write_buffer + i) != *(read_buffer + i)) {
              printf("%3d: data read different from data written, i = %d\n", 
                     my_rank, i);
              my_read_error = TRUE;
           }
        }
        
        MPI_Reduce (&my_read_error, &read_error, 1, MPI_INT, MPI_LOR, 
                    MASTER_RANK, MPI_COMM_WORLD);
If the read and write buffers contain identical data for all processes in the pool, the master process prints the message about it:
        if (i_am_the_master)
           if (! read_error)
              printf("--> All data read back is correct.\n");
And this is where the action ends. All processes then meet at MPI_Finalize and exit.


next up previous index
Next: Exercises Up: File Views Previous: Program darray.c
Zdzislaw Meglicki
2004-04-29