How Chunking and Compression Can Help You - Python and HDF5

Databases Reference

In-Depth Information

Try to express the “natural” access pattern your dataset will have

As in our example, if you are storing a bunch of images in a dataset and know that

your application will be reading particular 64×64 “tiles,” you could use N×64×64

chunks (or N×128×128) along the image axes.

Don't make them too small

Keep in mind that HDF5 has to use indexing data to keep track of things; if you use

something pathological like a 1-byte chunk size, most of your disk space will be

taken up by metadata. A good rule of thumb for most datasets is to keep chunks

above 10KiB or so.

Don't make them too big

The key thing to remember is that when you read any data in a chunk, the entire

chunk is read. If you only use a subset of the data, the extra time spent reading from

disk is wasted. Keep in mind that chunks bigger than 1 MiB by default will not

participate in the fast, in-memory “chunk cache” and will instead be read from disk

every time.

Performance Example: Resizable Datasets

In the last example of Chapter 3 , we discussed some of the performance aspects of

resizable datasets. It turns out that with one or two exceptions, HDF5 requires that

resizable datasets use chunked storage. This makes sense if you think about how con‐

tiguous datasets are stored; expanding any but the last axis would require rewriting the

entire dataset!

There are some chunk-related pitfalls when using resizable datasets, one of which il‐

lustrates why you have to be careful of using the auto-chunker where performance is

critical. It may make decisions that don't match your idea of how the dataset will be

used.

Revisiting the example in Chapter 3 , let's create two datasets to store a collection of 1000-

element-long time traces. The datasets will both be created as expandable along their

first axes, and differ only in their initial sizes:

>>> dset1 = f . create_dataset ( 'timetraces1' , ( 1 , 1000 ), maxshape = ( None , 1000 ))

>>> dset2 = f . create_dataset ( 'timetraces2' , ( 5000 , 1000 ), maxshape = ( None , 1000 ))

Recall that we had two different approaches to “appending” data to these arrays: simple

appending ( add_trace_1 ) and overallocate-and-trim ( add_trace_2 and done ). The

second approach was supposed to be faster, as it involved fewer calls to resize :

def add_trace_1 ( arr ):

""" Add one trace to the dataset, expanding it as necessary """

dset1 . resize ( ( dset1 . shape [ 0 ] + 1 , 1000 ) )

dset1 [ - 1 ,:] = arr

Search WWH ::

Custom Search

Home