How Chunking and Compression Can Help You - Python and HDF5

Databases Reference

In-Depth Information

• Integer (1, 2, 4, 8 byte; signed/unsigned) and floating-point (4/8 byte) types only

• Fast compression and decompression

• A decompressor that is almost always available

LZF Compression

For files you'll only be using from Python, LZF is a good choice. It ships with h5py; C

source code is available for third-party programs under the BSD license. It's optimized

for very, very fast compression at the expense of a lower compression ratio compared

to GZIP. The best use case for this is if your dataset has large numbers of redundant data

points. There are no compression_opts for this filter.

>>> dset = myfile . create_dataset ( "Dataset4" , ( 1000 ,), compression = "lzf" )

LZF compression:

• Works with all HDF5 types

• Features fast compression and decompression

• Is only available in Python (ships with h5py); C source available

Performance

As always, you should run your own performance tests to see what parts of your appli‐

cation would benefit from attention. However, here are some examples to give you an

idea of how the various filters stack up. In this experiment (see h5py.org/lzf for details),

a 4 MB dataset of single-precision floats was tested against the LZF, GZIP, and SZIP

compressors. A 190 KiB chunk size was used.

First, the data elements were assigned their own indices (see Table 4-1 ):

>>> data [ ... ] = np . arange ( 1024000 )

Table 4-1. Compression of trivial data

Compressor

Compression time (ms)

Decompression time (ms)

Compressed by

None

10.7

6.5

0.00%

LZF

18.6

17.8

96.66%

GZIP

58.1

40.5

98.53%

SZIP

63.1

61.3

72.68%

Next, a sine wave with added noise was tested (see Table 4-2 ):

>>> data [ ... ] = np . sin ( np . arange ( 1024000 ) / 32. ) + ( np . random ( 1024000 ) * 0.5 - 0.25 )

Python and HDF5

Search WWH ::

Custom Search

Home