next up previous index
Next: Creating and Structuring HDF5 Up: HDF5 Previous: Introduction

Structured versus Flat Files

An HDF5 file can be thought of as a directory tree within a file, the leaves of which are datasets or images.

This is a little similar to AFS  filesets, which are also implemented as UFS files. Each AFS filesets is a whole lightweight file system, with directories and individual AFS files, but physically it is implemented as a single UFS file. But whereas AFS couples to the kernel, so that you can go to an AFS fileset and use the standard UNIX ls command (or Windows dir) to view the directory of an AFS fileset, HDF5 is implemented on the application level. Consequently, you have to use HDF5 utilities or HDF5 library calls in order to view what's inside HDF5 files.

HDF5 directories are called groups. As is the case with directories, HDF5 groups can contain other groups as well as datasets (in the UNIX world we would think of a dataset as a non-directory file).

If you have ever worked closely with the old Macintosh operating system (pre MacOSX), you may remember that Macintosh files comprised two forks. To each file there was a data fork and an annotation fork. Similarly, each HDF5 dataset may contain data as well as attributes. Attributes are annotations. They can be used to provide additional information about the data, e.g., units, or when and where the data was collected.

Here is a conceptual picture of a small HDF5 file:

HDF5 "dset.h5" {
GROUP "/" {
   DATASET "dset" {
      DATATYPE  H5T_STD_I32BE
      DATASPACE  SIMPLE { ( 4, 6 ) / ( 4, 6 ) }
      DATA {
         1, 2, 3, 4, 5, 6,
         7, 8, 9, 10, 11, 12,
         13, 14, 15, 16, 17, 18,
         19, 20, 21, 22, 23, 24
      }
      ATTRIBUTE "Units" {
         DATATYPE  H5T_STD_I32BE
         DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
         DATA {
            100, 200
         }
      }
   }
}
}
The file is called dsetf.h5. The top level group in each HDF5 file is called "/". This is much the same as the top level directory in UNIX. In this case this group has no sub-groups, i.e., there are no subdirectories here. But we have one dataset, whose full pathname is "/dset", which comprises four items (or ``forks''). The first item specifies the type of the data. HDF5 introduces its own typing, much like MPI, so that you can write data in a portable format. If you were to write an HDF5 file on a PC, and then take it to, e.g., a Cray-X1, you should be able to read it with HDF5 utilities or programs without any loss of data. The type of the data written on this dataset is  H5T_STD_I32BE, which stands for a 32-bit big-endian integer. The second item specifies the size of the data space. It says that the data is written as a simple $4\times6$ matrix, with one data item per one slot. Then we have the data itself, which is the third item, and finally the attribute. The attribute itself is basically a small dataset, structured the same way as the dataset it describes. It has a datatype field, a dataspace field and then the data itself. In this case the data comprises two integers, but more often you would store a string there that would describe the data the attribute is attached to.

Now, you should not think that the data is written on the HDF5 file exactly as shown in the conceptual listing above. If you tried to just view the content of the HDF5 file with cat or type, all you'd see would be binary junk. You would not get much further with od either.

Here is another conceptual example of an HDF file:

HDF5 "groups.h5" {
GROUP "/" {
   GROUP "MyGroup" {
      GROUP "Group_A" {
         DATASET "dset2" {
            DATATYPE  H5T_STD_I32BE
            DATASPACE  SIMPLE { ( 2, 10 ) / ( 2, 10 ) }
            DATA {
               1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
               1, 2, 3, 4, 5, 6, 7, 8, 9, 10
            }
         }
      }
      GROUP "Group_B" {
      }
      DATASET "dset1" {
         DATATYPE  H5T_STD_I32BE
         DATASPACE  SIMPLE { ( 3, 3 ) / ( 3, 3 ) }
         DATA {
            1, 2, 3,
            1, 2, 3,
            1, 2, 3
         }
      }
   }
}
}
This file comprises the following groups (directories): Because /MyGroup contains /MyGroup/GroupA, which, in turn, contains /MyGroup/GroupA/dset2, you could also say that /MyGroup contains dset2 too.

Now let me show you how you what you are going to see if you view this file with some of the HDF5 tools. The name of the file is groups.h5 and the name of the tool I am going to use  is h5ls. It can be used, like UNIX ls to view the directory of the HDF5 file, but also, unlike ls to view the contents of the datasets.

There is no man entry provided with h5ls, but if you invoke it with any of the -h, -? or --help options, you'll get a brief synopsis:

gustav@bh1 $ h5ls --help
usage: h5ls [OPTIONS] [OBJECTS...]
   OPTIONS
      -h, -?, --help   Print a usage message and exit
      -a, --address    Print addresses for raw data
      -d, --data       Print the values of datasets
      -e, --errors     Show all HDF5 error reporting
      -f, --full       Print full path names instead of base names
      -g, --group      Show information about a group, not its contents
      -l, --label      Label members of compound datasets
      -r, --recursive  List all groups recursively, avoiding cycles
      -s, --string     Print 1-byte integer datasets as ASCII
      -S, --simple     Use a machine-readable output format
      -wN, --width=N   Set the number of columns of output
      -v, --verbose    Generate more verbose output
      -V, --version    Print version number and exit
      -x, --hexdump    Show raw data in hexadecimal format

   OBJECTS
      Each object consists of an HDF5 file name optionally followed by a
      slash and an object name within the file (if no object is specified
      within the file then the contents of the root group are displayed).
      The file name may include a printf(3C) integer format such as
      "%05d" to open a file family.
gustav@bh1 $
So let us just try h5ls groups.h5 first:
gustav@bh1 $ h5ls groups.h5
MyGroup                  Group
gustav@bh1 $
Well, the program says that there is one group there, at the top, called MyGroup. We can see all groups though with the recursive listing, much the same as you would invoke ls -R in order to a content of the whole directory tree:
gustav@bh1 $ h5ls -r groups.h5
/MyGroup                 Group
/MyGroup/Group_A         Group
/MyGroup/Group_A/dset2   Dataset {2, 10}
/MyGroup/Group_B         Group
/MyGroup/dset1           Dataset {3, 3}
gustav@bh1 $
You can view the content of any selected dataset within the HDF5 file by using the -d switch and passing the full name of the datset to h5ls as follows:
gustav@bh1 $ h5ls -d groups.h5/MyGroup/Group_A/dset2
MyGroup/Group_A/dset2    Dataset {2, 10}
    Data:
        (0,0) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
gustav@bh1 $
The data is not listed as a matrix, but you are informed about it being a matrix by the formatting statement Dataset {2, 10}.

The -v switch, which stands for --verbose, gives us much more information:

gustav@bh1 $ h5ls -v -r groups.h5
Opened "groups.h5" with sec2 driver.
/MyGroup                 Group
    Location:  0:1:0:1576
    Links:     1
/MyGroup/Group_A         Group
    Location:  0:1:0:2552
    Links:     1
/MyGroup/Group_A/dset2   Dataset {2/2, 10/10}
    Location:  0:1:0:5896
    Links:     1
    Modified:  2003-11-10 14:21:40 EST
    Storage:   80 logical bytes, 80 allocated bytes, 100.00% utilization
    Type:      32-bit big-endian integer
/MyGroup/Group_B         Group
    Location:  0:1:0:3528
    Links:     1
/MyGroup/dset1           Dataset {3/3, 3/3}
    Location:  0:1:0:5624
    Links:     1
    Modified:  2003-11-10 14:21:40 EST
    Storage:   36 logical bytes, 36 allocated bytes, 100.00% utilization
    Type:      32-bit big-endian integer
gustav@bh1 $
Here we find not only what is the location of each item within the HDF5 file, but even when each of the items was modified and how each dataset utilizes the space that has been allocated to it. In HDF5 you allocate a space for a dataset separately, and the dataset does not have to use all of it. But in this case the datasets fill the space that's been allocated to them entirely.

Program h5dump can  be used to view the whole file displayed in the same way as I have done above at the beginning of this section:

gustav@bh1 $ h5dump dset.h5
HDF5 "dset.h5" {
GROUP "/" {
   DATASET "dset" {
      DATATYPE  H5T_STD_I32BE
      DATASPACE  SIMPLE { ( 4, 6 ) / ( 4, 6 ) }
      DATA {
         1, 2, 3, 4, 5, 6,
         7, 8, 9, 10, 11, 12,
         13, 14, 15, 16, 17, 18,
         19, 20, 21, 22, 23, 24
      }
      ATTRIBUTE "Units" {
         DATATYPE  H5T_STD_I32BE
         DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
         DATA {
            100, 200
         }
      }
   }
}
}
gustav@bh1 $
This kind of notation, which is quite similar to the way you structure C and C++ programs, is called the Data Description Language, or DDL for short. The language can be formalized, using, e.g., Backus-Naur Form, but I'll stay away from it, because DDL is intuitive enough to be easily understandable without formalizing. But h5dump can dump your HDF5 data in other formats too. For example, if you use the -x switch, the content of the file will be dumped in XML. XML description though is horribly verbose and far from being as intuitive and clear as DDL.

Instead of dumping the whole HDF5, you can dump only a select object. For example:

gustav@bh1 $ h5dump -a /dset/Units dset.h5
HDF5 "dset.h5" {
ATTRIBUTE "/dset/Units" {
   DATATYPE  H5T_STD_I32BE
   DATASPACE  SIMPLE { ( 2 ) / ( 2 ) }
   DATA {
      100, 200
   }
}
}
gustav@bh1 $
This command lists the selected attribute, in this case /dset/Units, associated with the dataset /dset. A single data set can have several attributes with different names associated with it.

Calling h5dump with the --help option, invokes the brief description of the utility with all options listed and explained.

Now, once you know what structured  files are, let us go back to the basic question of this section: ``Structured versus Flat Files''. Why should we bother about structured files if we can always structure our data using the directory tree itself? After all, instead of writing various datsets on various parts of a structured HDF5 file, we could accomplish much the same by writing them on various files, possibly located in various directories. We could write separate files containing attributes and so on. In other words, we could simply take the whole structure of an HDF5 file out and lay it on top of a file system. Doesn't an HDF5 file immitate a file system internally? This, after all, is what most scientists do anyway.

The answer is portability, portability seen in two ways. First, portability from a researcher to a researcher. If you have your data organized in a directory tree, in order to exchange it with other researchers, you have to send them the whole tree - presumably collated into a single file, e.g., a tar archive. But tar is an operating system dependent utility. tar implies UNIX  and when the files get unpacked on, e.g., a Windows system or a VMS system or an MVS system or some other equally exotic system, they may not come out right. The data, if written in a small-endian fashion, may get all scrambled if a tar file is unpacked on a big-endian system. The annotations may get all lost. File names may get corrupted. Sometimes even the directory structure may get altered and lost too. Some systems may not allow as deep a directory nesting as other systems. So this brings us right to the issue of portability between operating systems and machine architectures.

HDF5 structured files, by bringing the structure into the file itself, release us from the dependence on the file system and on the operating system. By providing us with machine independent formats for data they release us from the dependence on the machine's architecture and hardware.

By letting us put all the data, annotations, and structuring into a single file, HDF5 helps us manage the data too. If you have a lot of data scattered over a huge directory, it's quite easy to get things wrong and either corrupt the data by renaming files incorrectly or placing them in the wrong directories, or even lose it by overwriting a file accidentally.

Last, but not least, traversing a directory tree and dealing with a large number of files from within your application, can be expensive.

If you can replace it all with a single file that has all the required structuring, annotations and data inside, your life as a programmer and maintainer may get quite a lot easier. At the end of the day, you may think of HDF5 files as small data bases. They are, sort of, right in between, where a single flat file is too primitive for handling your problem, and where a fully blown  data base would be an overkill.


next up previous index
Next: Creating and Structuring HDF5 Up: HDF5 Previous: Introduction
Zdzislaw Meglicki
2004-04-29