Databases Reference
In-Depth Information
Setting the Chunk Shape
You are certainly free to pick your own chunk shape, although be sure to read about the
performance implications later in this chapter. Most of the time, chunking will be au‐
tomatically enabled by using features like compression or marking a dataset as resizable.
In that case, the auto-chunker in h5py will help you pick a chunk size.
Auto-Chunking
If you don't want to sit down and figure out a chunk shape, you can have h5py try to
guess one for you by setting chunks to True instead of a tuple:
>>> dset = f . create_dataset ( "Images2" , ( 100 , 480 , 640 ), 'f' , chunks = True )
>>> dset . chunks
(13, 60, 80)
The “auto-chunker” tries to keep chunks mostly “square” (in N dimensions) and within
certain size limits. It's also invoked when you specify the use of compression or other
filters without explicitly providing a chunk shape.
By the way, the reason the automatically generated chunks are “square” in N dimensions
is that the auto-chunker has no idea what you're planning to do with the dataset, and is
hedging its bets. It's ideal for people who just want to compress a dataset and don't want
to bother with the details, but less ideal for those with specific time-critical access pat‐
terns.
Manually Picking a Shape
Here are some things to keep in mind when working with chunks. The process of picking
a chunk shape is a trade-off between the following three constraints:
1. Larger chunks for a given dataset size reduce the size of the chunk B-tree, making
it faster to find and load chunks.
2. Since chunks are all or nothing (reading a portion loads the entire chunk), larger
chunks also increase the chance that you'll read data into memory you won't use.
3. The HDF5 chunk cache can only hold a finite number of chunks. Chunks bigger
than 1 MiB don't even participate in the cache.
So here are the main points to keep in mind:
Do you even need to specify a chunk size?
It's best to restrict manual chunk sizing to cases when you know for sure your dataset
will be accessed in a way that's likely to be inefficient with either contiguous storage
or an auto-guessed chunk shape. And like all optimizations, you should benchmark!
Search WWH ::




Custom Search