How are we going to use data types in order to tell processes how to partition a file?
This is how. Suppose the picture below represents a file. Each little square corresponds to a data item of some elementary type, and this type may be quite complex. It is elementary not because it is simple, but because this is what the file is made of.
Now, let the filetype for process of rank 1 be:
Similarly the filetype for process of rank 2 is:
Now process of rank 0 is going to call function MPI_File_set_view to establish its view of the file as follows:
MPI_File_set_view is the file
handle;
MPI_File_set_view is the displacement
from the beginning of the file in bytes of the place where this
file view begins - a file may have different views associated
with it in various places;
MPI_File_set_view is the
elementary data type;
MPI_File_set_view is the
file data type, which must be defined in terms of
elementary data types;
MPI_File_set_view is a string
that defines the data representation - in this case
it is ``native'';
MPI_File_set_view is the info
structure - in this case it is MPI_INFO_NULL.
Once all processes have issued the calls this is how the file is going to be partitioned:
MPI_File_read, it is going to read
its own items only, i.e., the ones labeled with 1.
Its own file pointer will be automatically advanced to the required
location in the file.
So this is how the file gets partitioned without us having to specify
separate file offsets for each process explicitly. But constructing
such different file views for each process may not be all this easy
either. Luckily MPI-2 provides us with a very powerful
function
MPI_Type_create_darray
that can generate process dependent file views automatically.
But before I get to explain how this function works, let me go
back to MPI_File_set_view and explain in more detail the meaning of
its various arguments, as well as the behaviour of the function
itself.
MPI_File_set_view is a collective function. All processes
that have opened the file have to participate in this call.
The file handle and the data representation strings must be identical
for all processes. The extent of the elementary type, i.e.,
the distance between
its upper and its lower marker in bytes, must be
the same for all processes. But the processes may call this function
with different displacements, file types and infos. Note that
apart from differentiating the view with a process specific file
type, you may use different initial displacements too.
The data representation string specifies how the data that is passed
to MPI_File_write is going to be stored on the file itself.
The simplest way to write a file, especially under UNIX, is
to copy the bytes from memory to the disk without any further
processing. But under other operating systems files may have fancy
structures, multiple
forks, format records and what not. Even under UNIX Fortran files differ
from plain C-language files, because Fortran files may have record markers
embedded in them.
MPI defines three data representations and MPI implementations are free to add more. The three basic representations are:
When the file gets opened with MPI_File_open, you get
the default view, which is equivalent to the call:
MPI_File_set_view(fh, 0, MPI_BYTE, MPI_BYTE, "native", MPI_INFO_NULL);
Now let us get to MPI_Type_create_darray, the function that is going to make our task of defining process dependent file views easier.
This function does a lot of very hard work and, at the same time, it is going to save the programmer a lot of very hard work too, but for this very reason it is a little complicated. Its synopsis is as follows:
int MPI_Type_create_darray(
int size,
int rank,
int ndims,
int array_of_gsizes[],
int array_of_distribs[],
int array_of_dargs[],
int array_of_psizes[],
int order,
MPI_Datatype oldtype,
MPI_Datatype *newtype)
When called it is going to generate the datatypes corresponding
to the distribution of an ndims-dimensional array
of oldtype elements onto an ndims-dimensional
grid of logical processes.
Remember how we had a 2-dimensional grid of processes in
section 5.2.5 that talked about solving a diffusion problem.
There we also had a 2-dimensional array of integers,
which we have distributed manually
amongst the processes of the 2-dimensional grid, so that each process got
a small portion of it and then worked on it updating its edges by getting
values from its neighbours. Function MPI_Type_create_darray is
going to deliver us of such partitioning automatically.
The parameters of the function have the following meaning
ndims; each entry in the array
tells us about the number of elements of type
oldtype in the corresponding dimension
of the global array;
MPI_DISTRIBUTE_BLOCK - which
requests block distribution along the corresponding dimension,
MPI_DISTRIBUTE_CYCLIC - which
requests cyclic distribution along the corresponding dimension,
and MPI_DISTRIBUTE_NONE - which
requests no distribution along the corresponding dimension;
ndims; each entry in the array is the
argument that further specifies how the distribution of the
array should be done - there is one MPI constant provided
here, MPI_DISTRIBUTE_DFLT_DARG ,
which lets MPI do default distribution characterized only by
array_of_distribs;
ndims; each entry in the array
tells us about the number of processes
in the corresponding dimension of the process grid;
MPI_ORDER_FORTRAN
and MPI_ORDER_C .
MPI_Set_file_view.
At this stage I feel that you need a programming example to make sense of all this. So here it is.