next up previous index
Next: Activate Checksum for Error Up: Property Lists Previous: Create an Extendible Dataset

Create a Compressed Dataset

Default writes on HDF5 datasets are neither compressed nor is error checking activated for them either. Both can be turned on by the means of dataset creation property lists. In this section I am going to show you how to activate compression.

The following program, taken from the NCSA HDF5 Tutorial, does the following. First it creates a standard HDF5 data file called zip.h5. A group /Data is then created in the file. Then we get down to generate a property list for the dataset creation. The list is going to activate two features: chunking and compression. Then we create the dataset /Data/Compressed_Data using the list. The data itself is generated, then written on the dataset. At this stage we close the dataspace, the dataset, the group and the file.

Then we re-open the file, the group and the dataset. The data is read in full. There is no need for any hocus pocus with property lists here, because the required property is already attached to the dataset on the file and HDF5 learns about it when it opens the dataset. The decompression is activated automatically when the data is read. Having read the data we print a small portion of it on standard output, then close the dataset, the group and the file.

Here's the program:

/* Create compressed dataset */ 

#include "hdf5.h"

#define FILE    "zip.h5"

/* Uncomment to remove compression and 
   comment out line above
#define FILE    "unzip.h5"
*/

#define RANK    2
 
int
main(void)
{

    hid_t    file, grp;
    hid_t    dataset, dataspace;
    hid_t    plist; 

    herr_t   status;
    hsize_t  dims[2];
    hsize_t  cdims[2];
 
    int      idx;
    int      i,j;
    int      buf[1000][20];
    int      rbuf [1000][20];

    /*
     * Create a file.
     */
    file = H5Fcreate(FILE, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
    printf ("H5Fcreate returns: %d\n", file);

    /*
     * Create a group in the file. 
     */
    grp = H5Gcreate(file, "/Data", 0);
    printf ("H5Gcreate returns: %d\n", grp);

    /*
     * Create dataset "Compressed Data" in the group using absolute
     * name. Dataset creation property list is modified to use 
     * GZIP compression with the compression effort set to 6. 
     * Note that compression can be used only when dataset is chunked. 
     */
    dims[0] = 1000;
    dims[1] = 20;
    cdims[0] = 20;
    cdims[1] = 20;
    dataspace = H5Screate_simple(RANK, dims, NULL);
    printf ("H5Screate_simple: %d\n", dataspace);

/*  Uncomment this section if you want to use GZIP compression
    Be sure to comment out the line following, as well.

*/
    plist  = H5Pcreate(H5P_DATASET_CREATE);
    printf ("H5Pcreate returns: %d\n", plist);
    status = H5Pset_chunk(plist, 2, cdims);
    printf ("H5Pset_chunk returns: %d\n", status);
    status = H5Pset_deflate( plist, 6); 
    printf ("H5Pset_deflate returns: %d\n", status);

    dataset = H5Dcreate(file, "/Data/Compressed_Data", H5T_STD_I32BE, 
                        dataspace, plist); 

/*
    dataset = H5Dcreate(file, "/Data/Uncompressed_Data", H5T_STD_I32BE, 
                        dataspace, H5P_DEFAULT); 
*/

    printf ("H5Dcreate returns: %d\n", dataset);


 
    for (i = 0; i< dims[0]; i++) {
        for (j=0; j<dims[1]; j++) {
           buf[i][j] = i+j;
        }
    }
    status = H5Dwrite(dataset, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, H5P_DEFAULT, buf);
    printf ("H5Dwrite: %d\n", status);

    status = H5Sclose(dataspace);
    printf ("H5Sclose: %d\n", status);

    status = H5Dclose(dataset);
    printf ("H5Dclose: %d\n", status);

    status = H5Gclose (grp);
    printf ("H5Gclose: %d\n", status);

    status = H5Fclose(file);
    printf ("H5Fclose: %d\n", status);

    /*
     * Now reopen the file and group in the file. 
     */
    file = H5Fopen(FILE, H5F_ACC_RDWR, H5P_DEFAULT);
    printf ("H5Fopen: %d\n", file);
    grp  = H5Gopen(file, "Data");
    printf ("H5Gopen: %d\n", grp);

    dataset = H5Dopen(grp, "Compressed_Data");

/* Uncomment, if removing compression 
   and comment out line above
    dataset = H5Dopen(grp, "Uncompressed_Data");
*/
    printf ("H5Dopen: %d\n", dataset);

    status = H5Dread (dataset, H5T_NATIVE_INT, H5S_ALL, H5S_ALL, 
                      H5P_DEFAULT, rbuf); 
 
    printf ("\nData (10 lines):\n");

    for (i=0; i<10; i++)
    {
      for (j=0; j<20; j++)
         printf(" %d", rbuf[i][j]);
      printf ("\n");
    }

    status = H5Dclose(dataset);
    printf ("\nH5Dclose: %d\n", status);

    status = H5Gclose (grp);
    printf ("H5Gclose: %d\n", status);

    status = H5Fclose(file);
    printf ("H5Fclose: %d\n", status);

}
The program is compiled and linked with h5cc and run normally by invoking its name:
gustav@bh1 $ h5cc -o h5_zip h5_zip.c
gustav@bh1 $ ./h5_zip
H5Fcreate returns: 67108864
H5Gcreate returns: 201326592
H5Screate_simple: 335544322
H5Pcreate returns: 805306377
H5Pset_chunk returns: 0
H5Pset_deflate returns: 0
H5Dcreate returns: 402653184
H5Dwrite: 0
H5Sclose: 0
H5Dclose: 0
H5Gclose: 0
H5Fclose: 0
H5Fopen: 67108865
H5Gopen: 201326593
H5Dopen: 402653185

Data (10 lines):
 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

H5Dclose: 0
H5Gclose: 0
H5Fclose: 0
gustav@bh1 $
As the program runs it prints returns of its various internal function calls on standard output. The small portion of data printed at the end shows that the data read back from the compressed dataset is indeed correct. The data array itself is $1000\times20$ and its entries are aij = i + j, i.e., $0, 1, 2, \ldots$ in the first row, then $1, 2, 3, \ldots$ in the second row, $2, 3, 4, \ldots$ in the third row and so on.

But has the data been compressed? There are $1000\times20 = 20,000$4-byte long integers in the dataset, which translates into 80,000 bytes. But the file is only 11,312 bytes long:

gustav@bh1 $ ls -l zip.h5
-rw-r--r--    1 gustav   ucs         11312 Nov 24 12:56 zip.h5
gustav@bh1 $
so the data in it indeed must have been compressed. You can run h5dump on this file and you'll get all the data back uncompressed. But you won't find any hint that the data in the file is compressed either. To see this look at the file with h5ls:
gustav@bh1 $ h5ls -r -v zip.h5
Opened "zip.h5" with sec2 driver.
/Data                    Group
    Location:  0:1:0:1576
    Links:     1
/Data/Compressed_Data    Dataset {1000/1000, 20/20}
    Location:  0:1:0:1952
    Links:     1
    Modified:  2003-11-24 12:56:24 EST
    Chunks:    {20, 20} 1600 bytes
    Storage:   80000 logical bytes, 5316 allocated bytes, 1504.89% utilization
    Filter-0:  deflate-1 OPT {6}
    Type:      32-bit big-endian integer
gustav@bh1 $
Here you can see that the 80,000 logical bytes have been squeezed into 5,316 physical bytes and that a deflate-1 OPT {6} filter, as we have requested with the call to  H5Pset_deflate:
    status = H5Pset_deflate( plist, 6);
has been used.

There are no new elements in this program other than the call to H5Pset_deflate, so I won't discuss the program in detail. It should be easy for you to see, by now, how the program goes about its business. Function H5Pset_deflate takes a property list as its first argument. The function activates the GNU gzip algorithm on the data. If you look at the gzip man page, you'll see that you can regulate the compression speed by calling gzip with a flag such as -1 or -9. -1, which is equivalent to --fast, results in very fast but not very effective compression. On the other hand -9, which is equivalent to --best, results in slow but very effective compression. You can do the same when you call H5Pset_deflate. The second argument is the compression speed argument from gzip. It can be any integer between 1 and 9.


next up previous index
Next: Activate Checksum for Error Up: Property Lists Previous: Create an Extendible Dataset
Zdzislaw Meglicki
2004-04-29