How Chunking and Compression Can Help You - Python and HDF5

Databases Reference

In-Depth Information

Figure 4-3. HDF5 data pipeline, showing a dataset with GZIP and SHUFFLE filters ap‐

plied

Compression Filters

A number of compression filters are available in HDF5. By far the most commonly used

is the GZIP filter. (You'll also hear this referred to as the “DEFLATE” filter; in the HDF5

world both names are used for the same filter.)

Here's an example of GZIP compression used on a floating-point dataset:

>>> dset = f . create_dataset ( "BigDataset" , ( 1000 , 1000 ), dtype = 'f' , compres

sion = "gzip" )

>>> dset . compression

'gzip'

By the way, you're not limited to floats. The great thing about GZIP compression is that

it works with all fixed-width HDF5 types, not just numeric types.

Compression is transparent; data is read and written normally:

>>> dset [ ... ] = 42.0

>>> dset [ 0 , 0 ]

42.0

Investigating the Dataset object, we find a few more properties:

>>> dset . compression_opts

4

>>> dset . chunks

(63, 125)

The compression_opts property (and corresponding keyword to create_dataset ) re‐

flects any settings for the compression filter. In this case, the default GZIP level is 4.

Search WWH ::

Custom Search

Home