An HDF5 file can be thought of as a directory tree within a file, the leaves of which are datasets or images.
This is a little similar to AFS filesets, which are also
implemented as UFS files. Each AFS filesets is a whole lightweight file system, with directories and
individual AFS files, but physically it is
implemented as a single UFS file. But whereas AFS couples to the kernel, so that you
can go to an AFS fileset and use the standard UNIX ls command (or Windows dir)
to view the directory of an AFS fileset, HDF5 is implemented on the application level.
Consequently, you have to use HDF5 utilities or HDF5 library calls in order to view
what's inside HDF5 files.
HDF5 directories are called groups. As is the case with directories, HDF5 groups can contain other groups as well as datasets (in the UNIX world we would think of a dataset as a non-directory file).
If you have ever worked closely with the old Macintosh operating system (pre MacOSX), you may remember that Macintosh files comprised two forks. To each file there was a data fork and an annotation fork. Similarly, each HDF5 dataset may contain data as well as attributes. Attributes are annotations. They can be used to provide additional information about the data, e.g., units, or when and where the data was collected.
Here is a conceptual picture of a small HDF5 file:
HDF5 "dset.h5" {
GROUP "/" {
DATASET "dset" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 4, 6 ) / ( 4, 6 ) }
DATA {
1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24
}
ATTRIBUTE "Units" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
DATA {
100, 200
}
}
}
}
}
The file is called dsetf.h5. The top level group in each HDF5 file is
called "/". This is much the same as the top level directory in UNIX.
In this case this group has no sub-groups, i.e., there are no subdirectories here.
But we have one dataset, whose full pathname is "/dset", which comprises
four items (or ``forks''). The first item specifies the type of the data.
HDF5 introduces its own typing, much like MPI, so that you can write data
in a portable format. If you were to write an HDF5 file on a PC, and then take
it to, e.g., a Cray-X1, you should be able to read it with HDF5 utilities or
programs without any loss of data. The type of the data written on this dataset
is
H5T_STD_I32BE, which stands for a 32-bit big-endian integer. The second
item specifies the size of the data space. It says that the data is written
as a simple
Now, you should not think that the data is written on the HDF5 file exactly
as shown in the conceptual listing above. If you tried to just view the
content of the HDF5 file with cat or type, all you'd see
would be binary junk. You would not get much further with od either.
Here is another conceptual example of an HDF file:
HDF5 "groups.h5" {
GROUP "/" {
GROUP "MyGroup" {
GROUP "Group_A" {
DATASET "dset2" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 2, 10 ) / ( 2, 10 ) }
DATA {
1, 2, 3, 4, 5, 6, 7, 8, 9, 10,
1, 2, 3, 4, 5, 6, 7, 8, 9, 10
}
}
}
GROUP "Group_B" {
}
DATASET "dset1" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 3, 3 ) / ( 3, 3 ) }
DATA {
1, 2, 3,
1, 2, 3,
1, 2, 3
}
}
}
}
}
This file comprises the following groups (directories):
/
/MyGroup, which contains the dataset /MyGroup/dset1
/MyGroup/GroupA, which contains the dataset /MyGroup/GroupA/dset2
/MyGroup/GroupB, which is empty
/MyGroup contains /MyGroup/GroupA, which, in turn, contains
/MyGroup/GroupA/dset2, you could also say that /MyGroup
contains dset2 too.
Now let me show you how you what you are going to see if you view this file
with some of the HDF5 tools. The name of the file is groups.h5 and
the name of the tool I am going to use
is
h5ls.
It can be used, like
UNIX ls to view the directory of the HDF5 file, but also,
unlike ls to view the contents of the datasets.
There is no man entry provided with h5ls, but if you invoke it
with any of the -h, -? or --help options, you'll get a
brief synopsis:
gustav@bh1 $ h5ls --help
usage: h5ls [OPTIONS] [OBJECTS...]
OPTIONS
-h, -?, --help Print a usage message and exit
-a, --address Print addresses for raw data
-d, --data Print the values of datasets
-e, --errors Show all HDF5 error reporting
-f, --full Print full path names instead of base names
-g, --group Show information about a group, not its contents
-l, --label Label members of compound datasets
-r, --recursive List all groups recursively, avoiding cycles
-s, --string Print 1-byte integer datasets as ASCII
-S, --simple Use a machine-readable output format
-wN, --width=N Set the number of columns of output
-v, --verbose Generate more verbose output
-V, --version Print version number and exit
-x, --hexdump Show raw data in hexadecimal format
OBJECTS
Each object consists of an HDF5 file name optionally followed by a
slash and an object name within the file (if no object is specified
within the file then the contents of the root group are displayed).
The file name may include a printf(3C) integer format such as
"%05d" to open a file family.
gustav@bh1 $
So let us just try h5ls groups.h5 first:gustav@bh1 $ h5ls groups.h5 MyGroup Group gustav@bh1 $Well, the program says that there is one group there, at the top, called
MyGroup.
We can see all groups though with the recursive listing, much the
same as you would invoke ls -R in order to a content of the whole directory tree:
gustav@bh1 $ h5ls -r groups.h5
/MyGroup Group
/MyGroup/Group_A Group
/MyGroup/Group_A/dset2 Dataset {2, 10}
/MyGroup/Group_B Group
/MyGroup/dset1 Dataset {3, 3}
gustav@bh1 $
You can view the content of any selected dataset within the HDF5 file by
using the -d switch and passing the full name of the datset
to h5ls as follows:
gustav@bh1 $ h5ls -d groups.h5/MyGroup/Group_A/dset2
MyGroup/Group_A/dset2 Dataset {2, 10}
Data:
(0,0) 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
gustav@bh1 $
The data is not listed as a matrix, but you are informed about it being a matrix
by the formatting statement Dataset {2, 10}.
The -v switch, which stands for --verbose, gives us much more information:
gustav@bh1 $ h5ls -v -r groups.h5
Opened "groups.h5" with sec2 driver.
/MyGroup Group
Location: 0:1:0:1576
Links: 1
/MyGroup/Group_A Group
Location: 0:1:0:2552
Links: 1
/MyGroup/Group_A/dset2 Dataset {2/2, 10/10}
Location: 0:1:0:5896
Links: 1
Modified: 2003-11-10 14:21:40 EST
Storage: 80 logical bytes, 80 allocated bytes, 100.00% utilization
Type: 32-bit big-endian integer
/MyGroup/Group_B Group
Location: 0:1:0:3528
Links: 1
/MyGroup/dset1 Dataset {3/3, 3/3}
Location: 0:1:0:5624
Links: 1
Modified: 2003-11-10 14:21:40 EST
Storage: 36 logical bytes, 36 allocated bytes, 100.00% utilization
Type: 32-bit big-endian integer
gustav@bh1 $
Here we find not only what is the location of each item within the
HDF5 file, but even when each of the items was modified and how
each dataset utilizes the space that has been allocated to it.
In HDF5 you allocate a space for a dataset separately, and the dataset
does not have to use all of it. But in this case the datasets
fill the space that's been allocated to them entirely.
Program h5dump can be used to view the whole file displayed in the same way as I have done above at the beginning of this section:
gustav@bh1 $ h5dump dset.h5
HDF5 "dset.h5" {
GROUP "/" {
DATASET "dset" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 4, 6 ) / ( 4, 6 ) }
DATA {
1, 2, 3, 4, 5, 6,
7, 8, 9, 10, 11, 12,
13, 14, 15, 16, 17, 18,
19, 20, 21, 22, 23, 24
}
ATTRIBUTE "Units" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
DATA {
100, 200
}
}
}
}
}
gustav@bh1 $
This kind of notation, which is quite similar to the way you
structure C and C++ programs, is called the Data Description Language, or DDL for short.
The language can be formalized, using, e.g., Backus-Naur Form, but I'll stay away
from it, because DDL is intuitive enough to be easily understandable without
formalizing. But h5dump can dump your HDF5 data in other formats
too. For example, if you use the -x switch, the content of the
file will be dumped in XML. XML description though is horribly verbose and far from
being as intuitive and clear as DDL.
Instead of dumping the whole HDF5, you can dump only a select object. For example:
gustav@bh1 $ h5dump -a /dset/Units dset.h5
HDF5 "dset.h5" {
ATTRIBUTE "/dset/Units" {
DATATYPE H5T_STD_I32BE
DATASPACE SIMPLE { ( 2 ) / ( 2 ) }
DATA {
100, 200
}
}
}
gustav@bh1 $
This command lists the selected attribute, in this case /dset/Units,
associated with the dataset /dset. A single data set can have several
attributes with different names associated with it.
Calling h5dump with the --help option, invokes the brief description
of the utility with all options listed and explained.
Now, once you know what structured files are, let us go back to the basic question of this section: ``Structured versus Flat Files''. Why should we bother about structured files if we can always structure our data using the directory tree itself? After all, instead of writing various datsets on various parts of a structured HDF5 file, we could accomplish much the same by writing them on various files, possibly located in various directories. We could write separate files containing attributes and so on. In other words, we could simply take the whole structure of an HDF5 file out and lay it on top of a file system. Doesn't an HDF5 file immitate a file system internally? This, after all, is what most scientists do anyway.
The answer is portability, portability seen in
two ways. First, portability from a researcher to a researcher.
If you have your data organized in a directory tree, in order
to exchange it with other researchers, you have to send them
the whole tree - presumably collated into a single
file, e.g., a tar archive. But tar is an
operating system dependent utility. tar implies
UNIX and when the files get unpacked
on, e.g., a Windows system or a VMS system or an MVS system or some
other equally exotic system, they may not come out
right. The data, if written in a small-endian
fashion, may get all scrambled if a tar file is unpacked
on a big-endian system. The annotations
may get all lost. File names may get corrupted. Sometimes even the directory structure
may get altered and lost too. Some systems may not allow
as deep a directory nesting as other systems. So this
brings us right to the issue of portability between
operating systems and machine architectures.
HDF5 structured files, by bringing the structure into the file itself, release us from the dependence on the file system and on the operating system. By providing us with machine independent formats for data they release us from the dependence on the machine's architecture and hardware.
By letting us put all the data, annotations, and structuring into a single file, HDF5 helps us manage the data too. If you have a lot of data scattered over a huge directory, it's quite easy to get things wrong and either corrupt the data by renaming files incorrectly or placing them in the wrong directories, or even lose it by overwriting a file accidentally.
Last, but not least, traversing a directory tree and dealing with a large number of files from within your application, can be expensive.
If you can replace it all with a single file that has all the required structuring, annotations and data inside, your life as a programmer and maintainer may get quite a lot easier. At the end of the day, you may think of HDF5 files as small data bases. They are, sort of, right in between, where a single flat file is too primitive for handling your problem, and where a fully blown data base would be an overkill.