next up previous index
Next: Scattering Data Up: Partitioning MPI-IO/HDF5 Datasets Previous: Contiguous Hyperslabs

Striding Data

In this section we are going to use a stride and a block while selecting the hyperslab for partitioning of a fileset. The effect will be that the two processes that constitute the MPI pool in the example that follows will divide the HDF5 dataset into interleaving columns, i.e., process 1 is going to write column 1 while process 2 will write column 2, then process 1 will write column 3 while process 2 will write column 4 and so on. The result will be a dataset that looks as follows:

\begin{displaymath}\begin{array}{cccccc}
1 & 2 & 10 & 20 & 100 & 200 \\
1 & 2 &...
...0 & 20 & 100 & 200 \\
1 & 2 & 10 & 20 & 100 & 200
\end{array}\end{displaymath}

where the first digit of every number marks the process that has written the number.

The program must be run on two processes.

Here is how to compile and run it:

gustav@bh1 $ pwd
/N/B/gustav/src/I590/HDF5/mpi-io
gustav@bh1 $ h5cc -o hyperslab_by_col hyperslab_by_col.c 
gustav@bh1 $ mv hyperslab_by_col ~/bin
gustav@bh1 $ mpdboot
gustav@bh1 $ mpdtrace | wc
     20      20      99
gustav@bh1 $ cd /N/gpfs/gustav
gustav@bh1 $ ls
SDS_row.h5  t_mpio_1wMr  test  tests
gustav@bh1 $ mpiexec -n 2 hyperslab_by_col
gustav@bh1 $ ls
SDS_col.h5  SDS_row.h5  t_mpio_1wMr  test  tests
gustav@bh1 $ h5dump SDS_col.h5
HDF5 "SDS_col.h5" {
GROUP "/" {
   DATASET "IntArray" {
      DATATYPE  H5T_STD_I32LE
      DATASPACE  SIMPLE { ( 8, 6 ) / ( 8, 6 ) }
      DATA {
         1, 2, 10, 20, 100, 200,
         1, 2, 10, 20, 100, 200,
         1, 2, 10, 20, 100, 200,
         1, 2, 10, 20, 100, 200,
         1, 2, 10, 20, 100, 200,
         1, 2, 10, 20, 100, 200,
         1, 2, 10, 20, 100, 200,
         1, 2, 10, 20, 100, 200
      }
   }
}
}
gustav@bh1 $ mpdallexit
gustav@bh1 $

Now, here is the listing of the program, taken from the NCSA HDF5 Tutorial, and you'll find the discussion of the code further down.

/*  
 *  This example writes data to the HDF5 file by columns.
 *  Number of processes is assumed to be 2.
 */
 
#include "hdf5.h"

#define H5FILE_NAME     "SDS_col.h5"
#define DATASETNAME 	"IntArray" 
#define NX     8                      /* dataset dimensions */
#define NY     6 
#define RANK   2

int
main (int argc, char **argv)
{
    /*
     * HDF5 APIs definitions
     */ 	
    hid_t       file_id, dset_id;         /* file and dataset identifiers */
    hid_t       filespace, memspace;      /* file and memory dataspace identifiers */
    hsize_t     dimsf[2];                 /* dataset dimensions */
    hsize_t     dimsm[2];                 /* dataset dimensions */
    int         *data;                    /* pointer to data buffer to write */
    hsize_t	count[2];	          /* hyperslab selection parameters */
    hsize_t	stride[2];
    hsize_t	block[2];
    hssize_t	offset[2];
    hid_t	plist_id;                 /* property list identifier */
    int         i, j, k;
    herr_t	status;

    /*
     * MPI variables
     */
    int mpi_size, mpi_rank;
    MPI_Comm comm  = MPI_COMM_WORLD;
    MPI_Info info  = MPI_INFO_NULL;

    /*
     * Initialize MPI
     */
    MPI_Init(&argc, &argv);
    MPI_Comm_size(comm, &mpi_size);
    MPI_Comm_rank(comm, &mpi_rank);  
    /*
     * Exit if number of processes is not 2
     */
    if (mpi_size != 2) {
       printf("This example to set up to use only 2 processes \n");
       printf("Quitting...\n");
       return 0;
    }
 
    /* 
     * Set up file access property list with parallel I/O access
     */
     plist_id = H5Pcreate(H5P_FILE_ACCESS);
     H5Pset_fapl_mpio(plist_id, comm, info);

    /*
     * Create a new file collectively and release property list identifier.
     */
    file_id = H5Fcreate(H5FILE_NAME, H5F_ACC_TRUNC, H5P_DEFAULT, plist_id);
    H5Pclose(plist_id);
   

    /*
     * Create the dataspace for the dataset.
     */
    dimsf[0] = NX;
    dimsf[1] = NY;
    dimsm[0] = NX;
    dimsm[1] = NY/2;
    filespace = H5Screate_simple(RANK, dimsf, NULL); 
    memspace  = H5Screate_simple(RANK, dimsm, NULL); 

    /*
     * Create the dataset with default properties and close filespace.
     */
    dset_id = H5Dcreate(file_id, DATASETNAME, H5T_NATIVE_INT, filespace,
			H5P_DEFAULT);
    H5Sclose(filespace);

    /* 
     * Each process defines dataset in memory and writes it to the hyperslab
     * in the file.
     */
    count[0] = 1;
    count[1] = dimsm[1];
    offset[0] = 0;
    offset[1] = mpi_rank;
    stride[0] = 1;
    stride[1] = 2;
    block[0] = dimsf[0];
    block[1] = 1;

    /*
     * Select hyperslab in the file.
     */
    filespace = H5Dget_space(dset_id);
    H5Sselect_hyperslab(filespace, H5S_SELECT_SET, offset, stride, count, block);

    /*
     * Initialize data buffer 
     */
    data = (int *) malloc(sizeof(int)*dimsm[0]*dimsm[1]);
    for (i=0; i < dimsm[0]*dimsm[1]; i=i+dimsm[1]) {
        k = 1;
        for (j=0; j < dimsm[1]; j++) {
        data[i + j] = (mpi_rank +1) * k ;
        k = k * 10;
        } 
    }

    /*
     * Create property list for collective dataset write.
     */
    plist_id = H5Pcreate(H5P_DATASET_XFER);
    H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE);
    
    status = H5Dwrite(dset_id, H5T_NATIVE_INT, memspace, filespace,
		      plist_id, data);
    free(data);

    /*
     * Close/release resources.
     */
    H5Dclose(dset_id);
    H5Sclose(filespace);
    H5Sclose(memspace);
    H5Pclose(plist_id);
    H5Fclose(file_id);
 
    MPI_Finalize();

    return 0;
}

This MPI/HDF5 code begins the same way the previous code has: we initialize MPI, find about the size of the communicator and every process then finds its rank number. But then we check if the program is running on a number of processes that is different than 2 and if it is, we quit:

    if (mpi_size != 2) {
       printf("This example to set up to use only 2 processes \n");
       printf("Quitting...\n");
       return 0;
    }
Observe the dirty, UNIX-like, exit through "return" instead of MPI_Abort , or a jump to MPI_Finalize  at the end of the program. This is not nice, but, clearly, the authors didn't care. Such an exit is bound to trigger MPD error messages. But $\ldots$let's plough on.

After this we create a property list for the MPI file access:

     plist_id = H5Pcreate(H5P_FILE_ACCESS);
     H5Pset_fapl_mpio(plist_id, comm, info);
and then we create the file itself, whereupon the property list gets closed, because it is no longer needed:
#define H5FILE_NAME     "SDS_col.h5"
...
    file_id = H5Fcreate(H5FILE_NAME, H5F_ACC_TRUNC, H5P_DEFAULT, plist_id);
    H5Pclose(plist_id);
Now we are going to create two data spaces: one for the dataset on the file and one for the memory. Because we are going to split the dataset amongst the two processes of the pool, the memory data space for each process is going to be half of the dataset data space:
#define NX     8
#define NY     6 
#define RANK   2
...
    hsize_t     dimsf[2];
    hsize_t     dimsm[2];
...
    dimsf[0] = NX;
    dimsf[1] = NY;
    dimsm[0] = NX;
    dimsm[1] = NY/2;
    filespace = H5Screate_simple(RANK, dimsf, NULL); 
    memspace  = H5Screate_simple(RANK, dimsm, NULL);
Here NX is the number of rows and NY is the number of columns.

Now we can create the dataset and having done so, we're going to close the filespace, because it's no longer needed:

#define DATASETNAME 	"IntArray" 
...
    dset_id = H5Dcreate(file_id, DATASETNAME, H5T_NATIVE_INT, filespace,
			H5P_DEFAULT);
    H5Sclose(filespace);

At this stage, in preparation for writing, we are going to define our hyperslabs - these are going to be different for each process. The procedure is as follows: first we obtain the whole dataspace by interrogating the dataset and then we modify the dataspace by calling H5Sselect_hyperslab with the H5S_SELECT_SET argument. Here is the whole call:

    count[0] = 1;
    count[1] = dimsm[1];  /* i.e., count[1] = NY/2 = 3 */
    offset[0] = 0;
    offset[1] = mpi_rank; /* i.e., offset[1] = 0 or 1 */
    stride[0] = 1;
    stride[1] = 2;
    block[0] = dimsf[0];  /* i.e., block[0] = NX = 8 */
    block[1] = 1;

    filespace = H5Dget_space(dset_id);
    H5Sselect_hyperslab(filespace, H5S_SELECT_SET, offset, stride, count, block);
Let us read this definition aloud.

Consider process of rank 0. For this process offset[0] = 0 and offset[1] = 0 too. So for this process the hyperslab begins at the (0,0) location in the dataset, which corresponds to the beginning of the first column. But for process 1 offset[0] = 0 and offset[1] = 1, so for this process the hyperslab begins at the (0,1) location in the dataset, which corresponds to the beginning of the second column.

For both processes count[0] = 1, stride[0] = 1 - this corresponds to rows - and count[1] = 3 and stride[1] = 2 - this corresponds to columns - which means that each process is going to select every second block of data to the right three times, and each block is going to be 8 rows times 1 column, because block[0] = 8 and block[1] = 1.

In other words, process 0 will select columns 1, 3, and 5 and process 1 will select columns 2, 4, and 6.

Now each process writes the following data to its memory space:

    data = (int *) malloc(sizeof(int)*dimsm[0]*dimsm[1]);
    for (i=0; i < dimsm[0]*dimsm[1]; i=i+dimsm[1]) {
        k = 1;
        for (j=0; j < dimsm[1]; j++) {
           data[i + j] = (mpi_rank + 1) * k ;
           k = k * 10;
        } 
    }
Observe that only the k index is used in constructing the data, whereas i and j indexes are used in placing it. Also observe that k is reset to 1 every time j reaches dimsm[1] = 3. And so, for the process of rank 0 we are going to have:
data[] = 1, 10, 100, 1, 10, 100, ... 1, 10, 100
and for the process of rank 1 we are going to have:
data[] = 2, 20, 200, 2, 20, 200, ... 2, 20, 200
Now remember that in C matrices are stored in the row-contiguous fashion, which means that when this data is going to be mapped onto our memory space, which is a matrix $8 \times 3$, we'll get:
1, 10, 100,
1, 10, 100,
...
1, 10, 100
and
2, 20, 200,
2, 20, 200,
...
2, 20, 200

Finally, we are ready to write the data. We create the property list for collective MPI-IO write (unsing MPI_File_write_all ) and write the data by interpreting its memory layout according to memspace and its HDF5 file dataset layout according to filespace:

    plist_id = H5Pcreate(H5P_DATASET_XFER);
    H5Pset_dxpl_mpio(plist_id, H5FD_MPIO_COLLECTIVE);
    status = H5Dwrite(dset_id, H5T_NATIVE_INT, memspace, filespace,
		      plist_id, data);

And this is it: we free data, then we close the dataset, filespace, memspace, the property list and finally the HDF5 file itself. Then the processes meet at MPI_Finalize and we exit the program with return 0.


next up previous index
Next: Scattering Data Up: Partitioning MPI-IO/HDF5 Datasets Previous: Contiguous Hyperslabs
Zdzislaw Meglicki
2004-04-29