Databases Reference
In-Depth Information
Try to express the “natural” access pattern your dataset will have
As in our example, if you are storing a bunch of images in a dataset and know that
your application will be reading particular 64×64 “tiles,” you could use N×64×64
chunks (or N×128×128) along the image axes.
Don't make them too small
Keep in mind that HDF5 has to use indexing data to keep track of things; if you use
something pathological like a 1-byte chunk size, most of your disk space will be
taken up by metadata. A good rule of thumb for most datasets is to keep chunks
above 10KiB or so.
Don't make them too big
The key thing to remember is that when you read any data in a chunk, the entire
chunk is read. If you only use a subset of the data, the extra time spent reading from
disk is wasted. Keep in mind that chunks bigger than 1 MiB by default will not
participate in the fast, in-memory “chunk cache” and will instead be read from disk
every time.
Performance Example: Resizable Datasets
In the last example of Chapter 3 , we discussed some of the performance aspects of
resizable datasets. It turns out that with one or two exceptions, HDF5 requires that
resizable datasets use chunked storage. This makes sense if you think about how con‐
tiguous datasets are stored; expanding any but the last axis would require rewriting the
entire dataset!
There are some chunk-related pitfalls when using resizable datasets, one of which il‐
lustrates why you have to be careful of using the auto-chunker where performance is
critical. It may make decisions that don't match your idea of how the dataset will be
used.
Revisiting the example in Chapter 3 , let's create two datasets to store a collection of 1000-
element-long time traces. The datasets will both be created as expandable along their
first axes, and differ only in their initial sizes:
>>> dset1 = f . create_dataset ( 'timetraces1' , ( 1 , 1000 ), maxshape = ( None , 1000 ))
>>> dset2 = f . create_dataset ( 'timetraces2' , ( 5000 , 1000 ), maxshape = ( None , 1000 ))
Recall that we had two different approaches to “appending” data to these arrays: simple
appending ( add_trace_1 ) and overallocate-and-trim ( add_trace_2 and done ). The
second approach was supposed to be faster, as it involved fewer calls to resize :
def add_trace_1 ( arr ):
""" Add one trace to the dataset, expanding it as necessary """
dset1 . resize ( ( dset1 . shape [ 0 ] + 1 , 1000 ) )
dset1 [ - 1 ,:] = arr
Search WWH ::




Custom Search