next up previous index
Next: Program darray.c Up: File Views Previous: Derived Datatypes in MPI

Defining File Views in Terms of MPI Datatypes

How are we going to use data types in order to tell processes how to partition a file?

This is how. Suppose the picture below represents a file. Each little square corresponds to a data item of some elementary type, and this type may be quite complex. It is elementary not because it is simple, but because this is what the file is made of.


\begin{picture}(10,0.5)(0,0)
\put(0,0) {\framebox (0.5,0.5)}
\put(0.5,0){\fram...
... \put(9,0) {\framebox (0.5,0.5)}
\put(9.5,0){\framebox (0.5,0.5)}
\end{picture}
Suppose we have 4 processes in the pool. Let us define the filetype for the process of rank 0 to be:

\begin{picture}(2,0.5)(0,0)
\put(0,0) {\framebox (0.5,0.5){0}}
\put(0.5,0){\fr...
... \put(1,0) {\framebox (0.5,0.5)}
\put(1.5,0){\framebox (0.5,0.5)}
\end{picture}
where a 0 in the square means that this particular data item is considered full, and the remaining three squares without a 0 in them are empty. As far as process zero is concerned, they are just padding.

Now, let the filetype for process of rank 1 be:


\begin{picture}(2,0.5)(0,0)
\put(0,0) {\framebox (0.5,0.5)}
\put(0.5,0){\frame...
... \put(1,0) {\framebox (0.5,0.5)}
\put(1.5,0){\framebox (0.5,0.5)}
\end{picture}
where a 1 in the square means that this particular data item is considered full, and the remaining three squares without a 1 in them are empty. As far as process 1 is concerned, they are just padding.

Similarly the filetype for process of rank 2 is:


\begin{picture}(2,0.5)(0,0)
\put(0,0) {\framebox (0.5,0.5)}
\put(0.5,0){\frame...
...ut(1,0) {\framebox (0.5,0.5){2}}
\put(1.5,0){\framebox (0.5,0.5)}
\end{picture}
and for process of rank 3:

\begin{picture}(2,0.5)(0,0)
\put(0,0) {\framebox (0.5,0.5)}
\put(0.5,0){\frame...
...ut(1,0) {\framebox (0.5,0.5)}
\put(1.5,0){\framebox (0.5,0.5){3}}
\end{picture}

Now process of rank 0 is going to call function  MPI_File_set_view to establish its view of the file as follows:


\begin{picture}(9,0.5)(0,0)
\put(0,0){MPI\_File\_set\_view(fh, 0,}
\put(4,0){\fr...
....5,0.5)}
\put(7.25,0){,}
\put(7.5,0){''native'', MPI\_INFO\_NULL)}
\end{picture}
where Let me skip the detailed discussion of this function for the time being, and just show how the remaining processes are going to call it:

\begin{picture}(9,2.5)(0,0)
% Process 3
\put(0,0){MPI\_File\_set\_view(fh, 0,}
\...
....5,0.5)}
\put(7.25,2){,}
\put(7.5,2){''native'', MPI\_INFO\_NULL)}
\end{picture}
As you see for every process the only thing that is unique to the process is the file type.

Once all processes have issued the calls this is how the file is going to be partitioned:


\begin{picture}(10,0.5)(0,0)
\put(0,0) {\framebox (0.5,0.5){0}}
\put(0.5,0){\f...
...9,0) {\framebox (0.5,0.5){2}}
\put(9.5,0){\framebox (0.5,0.5){3}}
\end{picture}
When process 1, say, issues MPI_File_read, it is going to read its own items only, i.e., the ones labeled with 1. Its own file pointer will be automatically advanced to the required location in the file.

So this is how the file gets partitioned without us having to specify separate file offsets for each process explicitly. But constructing such different file views for each process may not be all this easy either. Luckily MPI-2 provides us with a very powerful function  MPI_Type_create_darray that can generate process dependent file views automatically. But before I get to explain how this function works, let me go back to MPI_File_set_view and explain in more detail the meaning of its various arguments, as well as the behaviour of the function itself.

MPI_File_set_view is a collective function. All processes that have opened the file have to participate in this call. The file handle and the data representation strings must be identical for all processes. The extent of the elementary type, i.e., the distance between its upper and its lower marker in bytes, must be the same for all processes. But the processes may call this function with different displacements, file types and infos. Note that apart from differentiating the view with a process specific file type, you may use different initial displacements too.

The data representation string specifies how the data that is passed to MPI_File_write is going to be stored on the file itself. The simplest way to write a file, especially under UNIX, is to copy the bytes from memory to the disk without any further processing. But under other operating systems files may have fancy structures, multiple forks, format records and what not. Even under UNIX Fortran files differ from plain C-language files, because Fortran files may have record markers embedded in them.

MPI defines three data representations and MPI implementations are free to add more. The three basic representations are:

"native"
Data in this representation is stored in a file exactly as it is in memory. This format is fine for homogeneous farms, but it may fail for heterogeneous farms, because of problems with big-endian versus small-endian conversions and other data representation incompatibilities.
"internal"
Data in this representation is written in an MPI-implementation dependent and operating system and architecture independent format. The MPI implementation will perform required conversions transparently when data is transferred between computational nodes, possibly of various architectures, and the file.
"external32"
Data in this representation is written in the ``big-endian IEEE'' format, which is also operating system and architecture independent. Writing data in this format will let you take the file away from the MPI environment, it's been written in, for processing on other machines under other operating systems and on other architectures.

When the file gets opened with MPI_File_open, you get the default view, which is equivalent to the call:

MPI_File_set_view(fh, 0, MPI_BYTE, MPI_BYTE, "native", MPI_INFO_NULL);

Now let us get to  MPI_Type_create_darray, the function that is going to make our task of defining process dependent file views easier.

This function does a lot of very hard work and, at the same time, it is going to save the programmer a lot of very hard work too, but for this very reason it is a little complicated. Its synopsis is as follows:

int MPI_Type_create_darray(
       int size,
       int rank,
       int ndims,
       int array_of_gsizes[],
       int array_of_distribs[],
       int array_of_dargs[],
       int array_of_psizes[],
       int order,
       MPI_Datatype oldtype,
       MPI_Datatype *newtype)
When called it is going to generate the datatypes corresponding to the distribution of an ndims-dimensional array of oldtype elements onto an ndims-dimensional grid of logical processes.

Remember how we had a 2-dimensional grid of processes in section 5.2.5 that talked about solving a diffusion problem. There we also had a 2-dimensional array of integers, which we have distributed manually amongst the processes of the 2-dimensional grid, so that each process got a small portion of it and then worked on it updating its edges by getting values from its neighbours. Function MPI_Type_create_darray is going to deliver us of such partitioning automatically.

The parameters of the function have the following meaning

size
this is the size of the process group;
rank
this is the rank of the process in the group;
ndims
this is the number of the process grid dimensions - here we assume that we have created a Cartesian grid of processes - this is also the number of the dimensions of the array that is going to be distributed amongst the processes, e.g., if we have a 3-dimensional array, we need to have a 3-dimensional grid of processes to distribute the array over;
array_of_gsizes
this is a one-dimensional array of positive integers of length ndims; each entry in the array tells us about the number of elements of type oldtype in the corresponding dimension of the global array;
array_of_distribs
this is a one-dimensional array of predefined MPI constants that specifies how a corresponding dimension of the matrix should be distributed; the three constants provided are MPI_DISTRIBUTE_BLOCK  - which requests block distribution along the corresponding dimension, MPI_DISTRIBUTE_CYCLIC  - which requests cyclic distribution along the corresponding dimension, and MPI_DISTRIBUTE_NONE  - which requests no distribution along the corresponding dimension;
array_of_dargs
this is a one-dimensional array of positive integers of length ndims; each entry in the array is the argument that further specifies how the distribution of the array should be done - there is one MPI constant provided here, MPI_DISTRIBUTE_DFLT_DARG , which lets MPI do default distribution characterized only by array_of_distribs;
array_of_psizes
this is a one-dimensional array of positive integers of length ndims; each entry in the array tells us about the number of processes in the corresponding dimension of the process grid;
order
this is the storage order flag: arrays may be stored either FORTRAN-style, i.e., column-major, or C-style, i.e., row-major. There are two predifined MPI constants that may be used here MPI_ORDER_FORTRAN  and MPI_ORDER_C .
oldtype
this is the MPI type of a single item in the array - because it is an array, every item in it is, of course, of the same type - but the types don't have to be basic, they may be quite complex structures;
newtype
this is the new MPI data type that is going to be specific to each process and that can be used in the call to MPI_Set_file_view.

At this stage I feel that you need a programming example to make sense of all this. So here it is.


next up previous index
Next: Program darray.c Up: File Views Previous: Derived Datatypes in MPI
Zdzislaw Meglicki
2004-04-29