next up previous index
Next: File Consistency Up: MPI IO Previous: Manipulating MPI Files

Writing and Reading MPI Files

Before you can begin writing on an MPI file in parallel, each process participating in the operation must acquire its own view of that file. A view is defined in terms of 3 parameters: a displacement, which is a location in the file given as the number of bytes from the beginning of the file, an elementary data type, the etype, and a filetype about which more below.

This business about views, filetypes and etypes is a little hard to understand without an example. Assume that we have some etype such as, e.g., a particle structure. This is a record that comprises a number of doubles, some integers, and some characters. We have seen how to build the corresponding MPI derived data type in one of the previous sections. Now, let us build a new MPI derived data type, which, say, picks up a second and third particle from an array of 6 particles. Symbolically we can write it as follows:

X
OXXOOO
where X in the first row stands for the etype and the second row represents the new derived data type with one hole, O, in front, then two particles, XX, and then 3 holes, OOO.

Let us define three derived MPI types as follows:

type1 = XXOOOO      
type2 = OOXXOO
type3 = OOOOXX

A view, as I have said above, is a triple (displacement, etype, filetype). Define the following three views:

(0, X, XXOOOO)
(0, X, OOXXOO)
(0, X, OOOOXX)
If these are the views that correspond to three different processes, when a parallel read takes place, the first process will read the first two particles, the second process will read particles 3 and 4 and the third process will read particles 5 and 6. Then the pointer is advanced to the beginning of the next filetype item and the read operation can commence. The view of the file that the first process has is:
XXOOOO XXOOOO XXOOOO XXOOOO ...
The second process sees the following data on the file:
OOXXOO OOXXOO OOXXOO OOXXOO ...
And the third process' view is:
OOOOXX OOOOXX OOOOXX OOOOXX ...
In this example all processes' view has the same displacement, but the filetypes are different. A similar effect same can be accomplished by giving 3 different displacements and sharing the same file view:
(0, X, XXOOOO)
(sizeof(XX), X, XXOOOO)
(sizeof(XXXX), X, XXOOOO)
In summary: in order to avoid stepping on each other's toes, each process must have a different view of the shared file. If the views are constructed soundly, then each process is going to work on a different portion of data.

So how do you construct a view? Use function:

int MPI_File_set_view (MPI_File fh, MPI_Offset displacement, 
                       MPI_Datatype etype, MPI_Datatype filetype,
                       char *datarep, MPI_Info info);
in C and
MPI_FILE_SET_VIEW(FH, DISP, ETYPE, FILETYPE, DATAREP, INFO, IERROR)
INTEGER FH, ETYPE, FILETYPE, INFO, IERROR
CHARACTER*(*) DATAREP
INTEGER(KIND=MPI_OFFSET_KIND) DISP
in Fortran.

There is one parameter here in these interfaces, which I haven't talked about yet. It is the Data Representation parameter, which is a string.

MPI guarantees full interoperability within a single MPI environment, but there is little support in it, as yet, for external data representation. Yet, the moment you begin writing MPI files, this issue gains in importance, because you are quite likely to process those files on a variety of architectures. The following predefined Data Representation strings are currently available

``native''
Data is stored on a file the same way it is stored in memory. This is very fast but non-portable. Use it when writing scratch files.
``internal''
This is a portable data format, which is supported across various platforms by a given MPI implementation, e.g., MPICH. You may not be able, in principle, to write data in this format with MPICH and then read it with LAM MPI or with IBM MPI. But you should be able to write data in this format with MPICH on Solaris and then read it, say, on DEC Alpha.
``external32''
All data is converted to and from ``external32''. Should work from MPI to MPI and from vendor to vendor. But data precision is lost (to 32-bits only), and I/O should be expected to suffer.

Once you've set a view on a file, you can also get it back with function

int MPI_File_get_view (MPI_File fh, MPI_Offset *displacement, 
                       MPI_Datatype *etype, MPI_Datatype *filetype,
                       char *datarep);
in C and similarly in Fortran.

So, at this stage all processes should have opened a file and should have defined their view on that file. Now we can begin to write data to the file and to read data from it.

Assuming that you have structure the data on the file with etype and filetype definitions the simplest way to write data on a file is to call function

int MPI_File_write (MPI_File fh, void *buffer, int count, 
                    MPI_Datatype datatype, MPI_Status *status);
in C and
MPI_FILE_WRITE(FH, BUFFER, COUNT, DATATYPE, STATUS, IERROR)
<type> BUF(*) 
INTEGER FH, COUNT, DATATYPE, STATUS(MPI_STATUS_SIZE), IERROR
This function transfers count data items of type datatype from a buffer pointed to by buffer to file fh. The data will be written at a position in the file pointed to by the file pointer. This operation will advance the pointer according to the formula:

\begin{displaymath}\mathit{new\_file\_offset} = \mathit{old\_file\_offset}
+ \f...
...})}
{\mathit{elements}(\mathit{etype})}
\times\mathit{count}
\end{displaymath}

If datatype is the same as filetype, which is a sensible thing to do, then the pointer will get advanced, in units of etype, by count filetypes, so that, in effect, the reading of the file will proceed as in the example discussed above.

Once you've written some data on the file you can read it back with

int MPI_File_read(MPI_File fh, void *buf, int count, 
                  MPI_Datatype datatype, MPI_Status *status)
in C and with
MPI_FILE_READ(FH, BUF, COUNT, DATATYPE, STATUS, IERROR)
<type> BUF(*) 
INTEGER FH, COUNT, DATATYPE, STATUS(MPI_STATUS_SIZE), IERROR
in Fortran.

These two functions, MPI_File_write and MPI_File_read are blocking and non-collective, i.e., each process does the reads on its own. Each process can read the file differently and in its own way and time. There is no barrier. Some processes may choose to read their data chunks from the file, some may forgo reading altogether, depending on what they do.

There is a collective version of these calls, which forces all processes in the communicator to read data simultaneously and to wait for each other. These collective functions are called MPI_File_read_all and MPI_File_write_all and their synopsis (though not their semantics) is the same as for the non-collective versions.

There are also non-blocking versions of of these functions. They are:

int MPI_File_iwrite(MPI_File fh, void *buf, int count, 
                    MPI_Datatype datatype, MPI_Request *request) 
MPI_FILE_IWRITE(FH, BUF, COUNT, DATATYPE, REQUEST, IERROR)
<type> BUF(*) 
INTEGER FH, COUNT, DATATYPE, REQUEST, IERROR
and
int MPI_File_iread(MPI_File fh, void *buf, int count, 
                   MPI_Datatype datatype, MPI_Request *request) 
MPI_FILE_IREAD(FH, BUF, COUNT, DATATYPE, REQUEST, IERROR)
<type> BUF(*) 
INTEGER FH, COUNT, DATATYPE, REQUEST, IERROR
As you see the list of parameters is the same with the exception that status is replaced with request. You have to keep inspecting the request to check if the operation has completed. Then you can inspect the status with another MPI function.

These non-blocking writes and reads are very useful. Any external I/O operations are excruciatingly slow compared with memory access or with operations that are done on the registers. Consequently if you can organise your program so that you issue a non-blocking I/O request in advance, then go back to your computations and keep checking every now and then if the I/O operation completed, you'll be able to mask the slowness of I/O with computations. Programs like that can be very fast. But they are also extremely difficult to write and to debug.

The functions discussed so far perform sequential writes within their respective views. What if you want to write data at various locations within your view jumping here and there out of order?

For this you would use a family of functions with the extension _AT. These functions are like the functions already discussed, but they take one more parameter, namely the offset from the beginning of the view.

File offsets in MPI/IO are always given in terms of etypes and are always measured from the beginning of the view. This is a matter of semantics, naming, and to agree on this simply saves unnecessary confusion. File displacements on the other hand are given in bytes and are measured from the beginning of the file.

The synopsis for the _AT functions is as follows:

int MPI_File_write_at(MPI_File fh, MPI_Offset offset, void *buf, 
                      int count, MPI_Datatype datatype, MPI_Status *status) 

MPI_FILE_WRITE_AT(FH, OFFSET, BUF, COUNT, DATATYPE, STATUS, IERROR)
<type> BUF(*) 
INTEGER FH, COUNT, DATATYPE, STATUS(MPI_STATUS_SIZE), IERROR 
INTEGER(KIND=MPI_OFFSET_KIND) OFFSET       

int MPI_File_read_at(MPI_File fh, MPI_Offset offset, void *buf, 
                     int count, MPI_Datatype datatype, MPI_Status *status) 

MPI_FILE_READ_AT(FH, OFFSET, BUF, COUNT, DATATYPE, STATUS, IERROR)
<type> BUF(*) 
INTEGER FH, COUNT, DATATYPE, STATUS(MPI_STATUS_SIZE), IERROR 
INTEGER(KIND=MPI_OFFSET_KIND) OFFSET
and similarly for the nonblocking and collective versions.

There is one more group of MPI reads and writes. For all functions discussed above every process would maintain its own file pointer. In the _AT functions that pointer would be manipulated explicitly. in MPI_FILE_READ it would be advanced implicitly. But every process would end up reading different data.

What if we want all processes to read the same data from the same file?

In this case we need to use data access functions with shared file pointers. The functions are:

int MPI_File_write_shared(MPI_File fh, void *buf, int count, 
                          MPI_Datatype datatype, MPI_Status *status) 

MPI_FILE_WRITE_SHARED(FH, BUF, COUNT, DATATYPE, STATUS, IERROR) 
<type> BUF(*) 
INTEGER FH, COUNT, DATATYPE, STATUS(MPI_STATUS_SIZE), IERROR       

int MPI_File_read_shared(MPI_File fh, void *buf, int count, 
                         MPI_Datatype datatype, MPI_Status *status) 
MPI_FILE_READ_SHARED(FH, BUF, COUNT, DATATYPE, STATUS, IERROR) 
<type> BUF(*) 
INTEGER FH, COUNT, DATATYPE, STATUS(MPI_STATUS_SIZE), IERROR
and they also have their collective and non-blocking counterparts.


next up previous index
Next: File Consistency Up: MPI IO Previous: Manipulating MPI Files
Zdzislaw Meglicki
2001-02-26