The program begins with the usual incantations to MPI, which result in every process learning about its rank number and the size of the pool of processes. The master process, which doesn't really do any mastering in this program, but occasionally speaks for all and reads the command line, learns about being a master.
The command line analysis is done with getopt and there is
one more option here, -v, which activates verbose output,
by setting a boolean variable verbose to TRUE.
Next the master program broadcasts the content of input_error to
all processes, so that they can all go directly to MPI_Finalize
if there is such, otherwise the action begins. The master process
broadcasts verbose, file_name_length and file_name
to other processes.
Having received this information all processes promptly get to prepare themselves for calling MPI_Type_create_darray
ndims = NDIMS;
for (i = 0; i < ndims; i++) {
array_of_gsizes[i] = SIZE;
array_of_distribs[i] = MPI_DISTRIBUTE_BLOCK;
array_of_dargs[i] = MPI_DISTRIBUTE_DFLT_DARG;
array_of_psizes[i] = 0;
}
MPI_Dims_create(pool_size, ndims, array_of_psizes);
order = MPI_ORDER_C;
The number of dimensions both of the process grid, which is yet to
be constructed and of the data array itself is set to NDIMS,
which is defined to be 3. So all these arrays used in the call
to MPI_Type_create_darray are going to be of length 3.
The first array, array_of_gsizes, specifies the sizes of the
global data array, and we set it here to be
SIZE is defined to be. All entries in
the array_of_distribs are set to
MPI_DISTRIBUTE_BLOCK, which means that we want block distribution,
rather than cyclic distribution of data amongst the processes.
Since the block distribution request doesn't take any arguments, all
entries in the array_of_dargs are set
to
MPI_DISTRIBUTE_DFLT_DARG. Finally we set all entries to
the array that specifies the process grid geometry to zeros. This is because
this array is going to be configured by the call
to
MPI_Dims_create.
This function divides a process pool into an ndims-dimensional
grid of processes, and returns the number of processes in each direction
of the grid on array_of_psizes. The total number of processes
must be such that they can be indeed organized into a grid. Otherwise
this function will return an error. The order parameter is
set
to MPI_ORDER_C, which means that the global data matrix
is going to be organized in the C-language style, i.e., row-major.
If the program is run in the verbose mode, all processes write the values of all these parameters on standard output, for example:
4: calling MPI_Type_create_darray with 4: pool_size = 27 4: my_rank = 4 4: ndims = 3 4: array_of_gsizes = (512, 512, 512) 4: array_of_distribs = (121, 121, 121) 4: array_of_dargs = (-49767, -49767, -49767) 4: array_of_psizes = (3, 3, 3) 4: order = 56 4: type = 1275069445You have to lookup
/N/hpc/mpich2/include/mpi.h to see that, e.g.,
-49767 is indeed MPI_DISTRIBUTE_DFLT_DARG and that
121 is indeed MPI_DISTRIBUTE_BLOCK and that
56 is indeed MPI_ORDER_C and that
1275069445 is indeed MPI_INT.
Now the program finally calls
MPI_Type_create_darray
and commits the new returned MPI data type called file_type:
MPI_Type_create_darray(pool_size, my_rank, ndims,
array_of_gsizes, array_of_distribs,
array_of_dargs, array_of_psizes, order,
MPI_INT, &file_type);
MPI_Type_commit(&file_type);
There are two important numbers that characterize this new MPI type,
its extent and its size. The extent of an MPI type
is the distance in bytes between the upper and the lower marker of
the type. The size of an MPI type is the total amount of
non-trivial data contained in the type, also in bytes. In other
words, the size is the extent of the type minus the padding.
We can extract these two numbers from the newly defined type by
calling functions
MPI_Type_extent and
MPI_Type_size :
MPI_Type_extent(file_type, &file_type_extent);
MPI_Type_size(file_type, &file_type_size);
if (verbose) {
printf("%3d: file_type_size = %d\n", my_rank, file_type_size);
printf("%3d: file_type_extent = %d\n", my_rank, file_type_extent);
}
It is instructive to inspect the output of the program in this place:[...] 8: file_type_size = 19767600 8: file_type_extent = 536870912 4: file_type_size = 20000844 4: file_type_extent = 536870912 [...]Observe that the extent of
file_type is 536,870,912 bytes, which is
the size of the file:gustav@bh1 $ ls -l /N/gpfs/gustav/darray total 524288 -rw-r--r-- 1 gustav ucs 536870912 Oct 26 17:49 test gustav@bh1 $but the size of
file_type differs from process to process and
corresponds to the amount of data this process is going to write on
the file, when the view is established. This is how the processes
divide the file amongst themselves. This is done without having to
manipulate processes' local pointers explicitly.
But there is a price to this procedure, which makes it somewhat impractical on systems such as the AVIDD cluster. The price is that the second argument in functionMPI_Type_extentmust be of typeMPI_Aint, andMPI_Aintis defined on the 32-bit systems to be simplyint, because the idea here is that you ought to be able to absorb a data item of this particular type into the memory of your computer. This means that functionMPI_Type_create_darrayas well asMPI_Type_extentwill fail if the total amount of data you want to partition exceedsINT_MAX, i.e., 2,147,483,647 bytes. This, however, is very little data by supercomputer standards. It is in places like this one that the limitations of 32-bit architectures can cause real pain. Luckily, we have a 64-bit system available to us at IU, it is our Research SP. There is also a small component of the AVIDD cluster at IUPUI, which comprises IA64 nodes.
Once data partitioning has been returned to the processes, they can allocate required amount of storage for the data:
write_buffer_size = file_type_size / sizeof(int);
write_buffer = (int*) malloc(write_buffer_size * sizeof(int));
/* We do this in case sizeof(int) does not divide file_type_size
exactly. But this should not happen if we have called
MPI_Type_create_darray with MPI_INT as the original data
type. */
if (! write_buffer) {
sprintf(message, "%3d: malloc write_buffer", my_rank);
perror(message);
MPI_Abort(MPI_COMM_WORLD, errno);
/* We can still abort, because we have not opened any
files yet. Notice that since MPI_Type_create_darray
will fail if SIZE^3 * sizeof(int) exceeds MAX_INT,
because MPI_Aint on AVIDD is a 32-bit integer,
we are rather unlikely to fail on this malloc
anyway. */
}
MPI_Barrier(MPI_COMM_WORLD);
/* We wait here in case some procs have problems with malloc. */
Observe that we are not going to proceed with opening the file until
all processes meet at the barrier, implying that none had problems
with malloc. Now they fill their memory buffers with numbers:
for (i = 0; i < write_buffer_size; i++)
*(write_buffer + i) = my_rank * SIZE + i;
Observe that this will result in each process filling its buffer
with different numbers. We need this in order to check, towards the
end of the program, that data read back from the file is identical
to data that has been written on it in the first place.
Now we open the file and immediately check for a possible problem:
file_open_error = MPI_File_open(MPI_COMM_WORLD, file_name,
MPI_MODE_CREATE | MPI_MODE_WRONLY,
MPI_INFO_NULL, &fh);
if (file_open_error != MPI_SUCCESS) {
MPI_Error_string(file_open_error, error_string,
&error_string_length);
fprintf(stderr, "%3d: %s\n", my_rank, error_string);
MPI_Abort(MPI_COMM_WORLD, file_open_error);
/* It is still OK to abort, because we have failed to
open the file. */
}
If there is a problem on open we can still abort the program
without having to worry about clean-up.
Function MPI_File_open is collective, which means that if
anybody has a problem opening the file, error will be returned to
all of them - and the file won't be opened.
The rest of the program is contains within the else clause
of the if statement:
else {
blah... blah... blah...
} /* no problem with file open */
} /* no problem with input error */
MPI_Finalize();
exit(0);
}
The first thing that happens within this clause is that the master
process changes permissions on the successfully opened file
from rw-rw-rw- to rw-r--r--, while other processes
wait on the barrier:
if (i_am_the_master)
chmod(file_name, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
MPI_Barrier(MPI_COMM_WORLD);
This shouldn't be necessary, because MPI provides means for changing
permissions on a newly created file by the means of info hints,
but MPICH2 hasn't implemented this part of the standard, so we have
to relay on UNIX to do this ourselves.
Now we convert our MPI type file_type into an official view
of the file:
MPI_File_set_view(fh, 0, MPI_INT, file_type, "native", MPI_INFO_NULL);
and write the data. This time we call the collective form
of MPI_File_write, which is
MPI_File_write_all.
It works much like MPI_File_write, but this time
all processes must do it at the same time, and any local errors become
global errors automatically. Consequently, if any of the processes
fails to write the data, all will know about it and all will enter
the following clause of the if statement:
file_write_error =
MPI_File_write_all(fh, write_buffer, write_buffer_size, MPI_INT,
&status);
if (file_write_error != MPI_SUCCESS) {
MPI_Error_string(file_write_error, error_string,
&error_string_length);
fprintf(stderr, "%3d: %s\n", my_rank, error_string);
MPI_File_close(&fh);
free(write_buffer);
if (i_am_the_master) MPI_File_delete(file_name, MPI_INFO_NULL);
}
If there is an error, all processes close the file - this is a collective
call too, and then the master process deletes it.
The rest of the program is again in the form of the else statement:
else {
blah... blah... blah...
} /* no problem with file write */
} /* no problem with file open */
} /* no input error */
MPI_Finalize();
exit(0);
}
The first thing that happens here is that every process inspects the
status returned by MPI_File_write_all for the number
of data written on the file (it should be the same as intended, but
this is a yet another way to check it), and the size of the whole
file. These numbers should agree with numbers obtained from the
call to MPI_Type_construct_darray:
MPI_Get_count(&status, MPI_INT, &count);
MPI_File_get_size(fh, &file_size);
if(verbose) {
printf("%3d: wrote %d integers\n", my_rank, count);
printf("%3d: file size is %lld bytes\n", my_rank, file_size);
}
and afterwards the file gets closed:
MPI_File_close(&fh);
The last part of the program opens the file again, but this time for reading. We allocate space for the reading buffer first, check if there are no problems with malloc and then open the file for reading only:
read_buffer_size = write_buffer_size;
read_buffer = (int*) malloc(read_buffer_size * sizeof(int));
if (! read_buffer) {
sprintf(message, "%3d: malloc read_buffer", my_rank);
perror(message);
MPI_Abort(MPI_COMM_WORLD, errno);
/* We can abort, because the file has been closed and
we haven't opened it for reading yet. */
}
MPI_Barrier(MPI_COMM_WORLD);
/* We wait here in case some procs have problems with malloc. */
MPI_File_open(MPI_COMM_WORLD, file_name, MPI_MODE_RDONLY,
MPI_INFO_NULL, &fh);
Next we establish the same view of the file as before, read all
data back from it, every process checks how much data has really
been read by inspecting the status, and then the file gets
closed again:
MPI_File_set_view(fh, 0, MPI_INT, file_type, "native", MPI_INFO_NULL);
MPI_File_read_all(fh, read_buffer, read_buffer_size, MPI_INT, &status);
MPI_Get_count(&status, MPI_INT, &count);
if (verbose)
printf("%3d: read %d integers\n", my_rank, count);
MPI_File_close(&fh);
Now every process compares its read_buffer with its
write_buffer, which
we have wisely saved, and the result of the comparison is
reduced on the variable read_error maintained by the
master process:
for (i = 0; i < read_buffer_size; i++) {
if (*(write_buffer + i) != *(read_buffer + i)) {
printf("%3d: data read different from data written, i = %d\n",
my_rank, i);
my_read_error = TRUE;
}
}
MPI_Reduce (&my_read_error, &read_error, 1, MPI_INT, MPI_LOR,
MASTER_RANK, MPI_COMM_WORLD);
If the read and write buffers contain identical data for all processes
in the pool, the master process prints the message about it:
if (i_am_the_master)
if (! read_error)
printf("--> All data read back is correct.\n");
And this is where the action ends. All processes then meet at
MPI_Finalize and exit.