In this section we are going to use a stride and a block while
selecting the hyperslab for partitioning of a fileset. The effect
will be that the two processes that constitute the MPI pool
in the example that follows will divide the HDF5 dataset into
interleaving columns, i.e., process 1 is going to write
column 1 while process 2 will write column 2, then process 1 will
write column 3 while process 2 will write column 4 and so on.
The result will be a dataset that looks as follows:
The program must be run on two processes.
Here is how to compile and run it:
gustav@bh1 $ pwd
/N/B/gustav/src/I590/HDF5/mpi-io
gustav@bh1 $ h5cc -o hyperslab_by_col hyperslab_by_col.c
gustav@bh1 $ mv hyperslab_by_col ~/bin
gustav@bh1 $ mpdboot
gustav@bh1 $ mpdtrace | wc
20 20 99
gustav@bh1 $ cd /N/gpfs/gustav
gustav@bh1 $ ls
SDS_row.h5 t_mpio_1wMr test tests
gustav@bh1 $ mpiexec -n 2 hyperslab_by_col
gustav@bh1 $ ls
SDS_col.h5 SDS_row.h5 t_mpio_1wMr test tests
gustav@bh1 $ h5dump SDS_col.h5
HDF5 "SDS_col.h5" {
GROUP "/" {
DATASET "IntArray" {
DATATYPE H5T_STD_I32LE
DATASPACE SIMPLE { ( 8, 6 ) / ( 8, 6 ) }
DATA {
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200,
1, 2, 10, 20, 100, 200
}
}
}
}
gustav@bh1 $ mpdallexit
gustav@bh1 $
Now, here is the listing of the program, taken from the NCSA HDF5 Tutorial, and you'll find the discussion of the code further down.
/*
* This example writes data to the HDF5 file by columns.
* Number of processes is assumed to be 2.
*/
#include "hdf5.h"
#define H5FILE_NAME "SDS_col.h5"
#define DATASETNAME "IntArray"
#define NX 8 /* dataset dimensions */
#define NY 6
#define RANK 2
int
main (int argc, char **argv)
{
/*
* HDF5 APIs definitions
*/
hid_t file_id, dset_id; /* file and dataset identifiers */
hid_t filespace, memspace; /* file and memory dataspace identifiers */
hsize_t dimsf[2]; /* dataset dimensions */
hsize_t dimsm[2]; /* dataset dimensions */
int *data; /* pointer to data buffer to write */
hsize_t count[2]; /* hyperslab selection parameters */
hsize_t stride[2];
hsize_t block[2];
hssize_t offset[2];
hid_t plist_id; /* property list identifier */
int i, j, k;
herr_t status;
/*
* MPI variables
*/
int mpi_size, mpi_rank;
MPI_Comm comm = MPI_COMM_WORLD;
MPI_Info info = MPI_INFO_NULL;
/*
* Initialize MPI
*/
MPI_Init(&argc, &argv);
MPI_Comm_size(comm, &mpi_size);
MPI_Comm_rank(comm, &mpi_rank);
/*
* Exit if number of processes is not 2
*/
if (mpi_size != 2) {
printf("This example to set up to use only 2 processes \n");
printf("Quitting...\n");
return 0;
}
/*
* Set up file access property list with parallel I/O access
*/
plist_id = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_fapl_mpio(plist_id, comm, info);
/*
* Create a new file collectively and release property list identifier.
*/
file_id = H5Fcreate(H5FILE_NAME, H5F_ACC_TRUNC, H5P_DEFAULT, plist_id);
H5Pclose(plist_id);
/*
* Create the dataspace for the dataset.
*/
dimsf[0] = NX;
dimsf[1] = NY;
dimsm[0] = NX;
dimsm[1] = NY/2;
filespace = H5Screate_simple(RANK, dimsf, NULL);
memspace = H5Screate_simple(RANK, dimsm, NULL);
/*
* Create the dataset with default properties and close filespace.
*/
dset_id = H5Dcreate(file_id, DATASETNAME, H5T_NATIVE_INT, filespace,
H5P_DEFAULT);
H5Sclose(filespace);
/*
* Each process defines dataset in memory and writes it to the hyperslab
* in the file.
*/
count[0] = 1;
count[1] = dimsm[1];
offset[0] = 0;
offset[1] = mpi_rank;
stride[0] = 1;
stride[1] = 2;
block[0] = dimsf[0];
block[1] = 1;
/*
* Select hyperslab in the file.
*/
filespace = H5Dget_space(dset_id);
H5Sselect_hyperslab(filespace, H5S_SELECT_SET, offset, stride, count, block);
/*
* Initialize data buffer
*/
data = (int *) malloc(sizeof(int)*dimsm[0]*dimsm[1]);
for (i=0; i < dimsm[0]*dimsm[1]; i=i+dimsm[1]) {
k = 1;
for (j=0; j < dimsm[1]; j++) {
data[i + j] = (mpi_rank +1) * k ;
k = k * 10;
}
}
/*
* Create property list for collective dataset write.
*/
plist_id = H5Pcreate(H5P_DATASET_XFER);
H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE);
status = H5Dwrite(dset_id, H5T_NATIVE_INT, memspace, filespace,
plist_id, data);
free(data);
/*
* Close/release resources.
*/
H5Dclose(dset_id);
H5Sclose(filespace);
H5Sclose(memspace);
H5Pclose(plist_id);
H5Fclose(file_id);
MPI_Finalize();
return 0;
}
This MPI/HDF5 code begins the same way the previous code has: we initialize MPI, find about the size of the communicator and every process then finds its rank number. But then we check if the program is running on a number of processes that is different than 2 and if it is, we quit:
if (mpi_size != 2) {
printf("This example to set up to use only 2 processes \n");
printf("Quitting...\n");
return 0;
}
Observe the dirty, UNIX-like, exit through "return" instead of
MPI_Abort , or a jump to
MPI_Finalize at the end of the
program. This is not nice, but, clearly, the authors didn't care.
Such an exit is bound to trigger MPD error messages. But After this we create a property list for the MPI file access:
plist_id = H5Pcreate(H5P_FILE_ACCESS);
H5Pset_fapl_mpio(plist_id, comm, info);
and then we create the file itself, whereupon the property list gets
closed, because it is no longer needed:
#define H5FILE_NAME "SDS_col.h5"
...
file_id = H5Fcreate(H5FILE_NAME, H5F_ACC_TRUNC, H5P_DEFAULT, plist_id);
H5Pclose(plist_id);
Now we are going to create two data spaces: one for the dataset on the
file and one for
the memory. Because we are going to split the dataset amongst the two
processes of the pool, the memory data space for each process is
going to be half of the dataset data space:
#define NX 8
#define NY 6
#define RANK 2
...
hsize_t dimsf[2];
hsize_t dimsm[2];
...
dimsf[0] = NX;
dimsf[1] = NY;
dimsm[0] = NX;
dimsm[1] = NY/2;
filespace = H5Screate_simple(RANK, dimsf, NULL);
memspace = H5Screate_simple(RANK, dimsm, NULL);
Here NX is the number of rows and NY is the number
of columns.
Now we can create the dataset and having done so, we're going to
close the filespace, because it's no longer needed:
#define DATASETNAME "IntArray"
...
dset_id = H5Dcreate(file_id, DATASETNAME, H5T_NATIVE_INT, filespace,
H5P_DEFAULT);
H5Sclose(filespace);
At this stage, in preparation for writing, we are going to define
our hyperslabs - these are going to be different for each process.
The procedure is as follows: first we obtain the whole dataspace
by interrogating the dataset and then we modify the
dataspace by calling H5Sselect_hyperslab with the
H5S_SELECT_SET argument. Here is the whole call:
count[0] = 1;
count[1] = dimsm[1]; /* i.e., count[1] = NY/2 = 3 */
offset[0] = 0;
offset[1] = mpi_rank; /* i.e., offset[1] = 0 or 1 */
stride[0] = 1;
stride[1] = 2;
block[0] = dimsf[0]; /* i.e., block[0] = NX = 8 */
block[1] = 1;
filespace = H5Dget_space(dset_id);
H5Sselect_hyperslab(filespace, H5S_SELECT_SET, offset, stride, count, block);
Let us read this definition aloud.
Consider process of rank 0. For this process
offset[0] = 0 and offset[1] = 0 too. So for this process the
hyperslab begins at the (0,0) location in the dataset, which corresponds
to the beginning of the first column. But for
process 1 offset[0] = 0 and offset[1] = 1, so for this process
the hyperslab begins at the (0,1) location in the dataset, which corresponds
to the beginning of the second column.
For both processes count[0] = 1, stride[0] = 1 - this
corresponds to rows -
and count[1] = 3 and stride[1] = 2 - this corresponds to columns -
which means that
each process is going to select every second block of data to the right three times,
and each block is going to be 8 rows times 1 column, because block[0] = 8
and block[1] = 1.
In other words, process 0 will select columns 1, 3, and 5 and process 1 will select columns 2, 4, and 6.
Now each process writes the following data to its memory space:
data = (int *) malloc(sizeof(int)*dimsm[0]*dimsm[1]);
for (i=0; i < dimsm[0]*dimsm[1]; i=i+dimsm[1]) {
k = 1;
for (j=0; j < dimsm[1]; j++) {
data[i + j] = (mpi_rank + 1) * k ;
k = k * 10;
}
}
Observe that only the k index is used in constructing the data, whereas
i and j indexes are used in placing it. Also observe that k
is reset to 1 every time j reaches dimsm[1] = 3. And so,
for the process of rank 0 we are going to have:data[] = 1, 10, 100, 1, 10, 100, ... 1, 10, 100and for the process of rank 1 we are going to have:
data[] = 2, 20, 200, 2, 20, 200, ... 2, 20, 200Now remember that in C matrices are stored in the row-contiguous fashion, which means that when this data is going to be mapped onto our memory space, which is a matrix
1, 10, 100, 1, 10, 100, ... 1, 10, 100and
2, 20, 200, 2, 20, 200, ... 2, 20, 200
Finally, we are ready to write the data. We create the property list for
collective MPI-IO write (unsing MPI_File_write_all ) and write the data by interpreting its memory layout according to memspace
and its HDF5 file dataset layout according to filespace:
plist_id = H5Pcreate(H5P_DATASET_XFER);
H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE);
status = H5Dwrite(dset_id, H5T_NATIVE_INT, memspace, filespace,
plist_id, data);
And this is it: we free data, then we close the dataset, filespace,
memspace, the property list and finally the HDF5 file itself.
Then the processes meet at MPI_Finalize and we exit the program
with return 0.