next up previous index
Next: IU HPSS Up: Working with Data on Previous: Working with Data on

   
The AVIDD GPFS

GPFS is a truly parallel file system. In case of the AVIDD cluster the Bloomington component of GPFS  is served  by bf1, bf2, bf3, and bf4 and the Indianapolis component is served by if1, if2, if3, and if4.

What does truly parallel mean in this context? It means that every file that's written on GPFS ends up being striped over four different machines, in our case. The striping is much like striping of data on a disk array, but here the difference is that the GPFS ``disk array'' is assembled from disk arrays attached to several machines. This has the advantage of overcoming IO limitations of a single server.

Our experience with PC blades tells us that an IA32 server with a disk array attached to it can write at something about 15 MB/s on the array and it can read at up to 40 MB/s from the array. So if you have a very large file that is striped over four such servers, you should expect four times the above performance, i.e., 60 MB/s on the writes and 160 MB/s on the reads. And this is indeed, more or less, what we see on the AVIDD  cluster (actually this is more rather than less, but we're within the right order of magnitude here).

60 MB/s on the writes does not amount to much. This is not high performance computing. For high performance computing we would need at least ten times more, which means 40 GPFS servers at IUB and another 40 at IUPUI. But for the time being we don't have projects in place that need this level of performance and so we have only four servers at each site. Still, this is enough for us to learn about this technology, and if you ever need much higher levels of performance, you can always get an account at the PSC , or NCSA , or SDSC .

SDSC  is especially well equipped for data intensive work and they have demonstrated transfers from their tape silos at 828 MB/s. NCSA  engineers tested a system with 40 GPFS servers on their IA32 cluster, getting expected performance, but they reduced the number of GPFS server nodes to four at present, and in future they're going to use a different solution, based on disks attached to a Storage  Area  Network (SAN) and PVFS, which is a freeware  parallel file system from Clemson University in  South Carolina.

Note that in order to have very high IO transfer rates to parallel processes running on clusters you must have a parallel file system. Whether this file system is served from disk arrays on SAN or from disk arrays directly attached to some server nodes is another issue.

But let's get back to AVIDD . The directory /N/gpfsb is writable to all. So the first thing you need to do is to create your own subdirectory there:

[gustav@bh1 gustav]$ cd /N/gpfsb
[gustav@bh1 gpfsb]$ mkdir gustav
[gustav@bh1 gpfsb]$
From this point on you can use this directory like a normal UNIX file system. This, in particular, is where you should run your parallel jobs from.

GPFS files  are protected by the usual UNIX permissions, i.e., read (r), write (w) and execute (x) for the user, group and others. GPFS on the SP also supports additional controls in the form of Access Control Lists, ACLs. These can be manipulated with the "mm" commands, like mmeditacl, mmgetacl and mmputacl, which are described in /usr/man/man1, but these commands don't seem to work on the AVIDD GPFS. The IA32 version of GPFS is quite restricted compared to the fully functional GPFS you get on the SP.

Although AVIDD GPFS delivers only 60/160 MB/s on writes/reads, there are situations when you may notice much higher transfer rates, especially on writes. How can this be? The answer is memory  caching. UNIX never writes data on the media directly. When you open a file, write on it, close it, even when you flush (UNIX  programmers should know what this means), the data does not go the disks. Even when UNIX thinks that it has pushed the data to the disk, the data may be still stuck in the disk's own memory  cache. These multiple levels of memory caches, both on UNIX, on GPFS, and on the physical disk arrays themselves serve to mask the slowness with which data is written on the actual physical media compared to the speed with which it can be handled internally within the computer. Because GPFS server nodes and computational nodes all have a lot of memory (I think they have at least 2 GB each), they can cache a lot of IO, perhaps even hundreds of MBs. So if you write just a 100 MBs of data to GPFS, you may discover that the write has occurred at memory speeds, not disk speeds. The only time you actually get to see the real IO transfer rate is when you write a very, very large file. Then all the memory caches overflow and data eventually has to be written on the physical media. As you keep pushing the data from your program you will see the transfer rate drop and drop until it reaches the real disk write speed eventually.

Another trick that hardware vendors employ is automatic data compression on disk arrays. This is often done by a special chip that is embedded in the array controller. The effect of this is that if you write a file of, say, zeros on the drive, the transfer rate is going to be phenomenal, even with all the memory caching out of the way. So, if you want to test real IO in this context, you need to write strings of random numbers or characters on the drive. Such strings cannot be compressed.

All this memory caching and automatic  data compression are good things and there is a way to make use of them even when you work with very large files. The way is to process a file on a very large number of GPFS clients. If you write a file in parallel from, say, 40 GPFS clients, even if the file is 10 GBs long, you'll end up writing only 250 MBs per client. This amount of data is probably going to be cached in the client's memory while your program executes and closes the file. The data will take much longer to flow to the media, of course, but all this is going to take place in the background and you will not have to wait for it. As far as you are concerned, you will probably have written this file at several hundreds of MBs per second.

Physical writes are always  slower than physical reads. There are physical reasons for this. When you read data from a disk, you need to find where the data is. This is usually a pretty fast process, based on hashed tables or something similar. Eventually you get to the data, the head lowers and reads the data from the disk surface. The data then flows to the user. When you write data on the disk, the process is slower because, first, you have to locate where the best free space is, then you have to add new records to the hashed table and this is a slower process than just locating existing data, and finally you have to write data on the media. The writing process is more involved: stronger magnetic fields have to be applied, there is a lot of checking, if there are some errors, then the writes are repeated and portions of the disk may have to be marked as bad, and so on.

Virtual writes  are always faster than virtual reads. The reason for this is that virtual writes only write data on memory and so they can be very fast. But virtual reads almost always have to read data from the disk physically, unless the data has been read very recently, and is still cached in the disk's controller.


next up previous index
Next: IU HPSS Up: Working with Data on Previous: Working with Data on
Zdzislaw Meglicki
2004-04-29