Databases Reference
In-Depth Information
Figure 4-1. Contiguous storage on disk, accessing (A) an entire image all at once, and
(B) a 64x64 image tile. Gray regions are the data being retrieved.
It's easy to see that applications reading a whole image, or a series of whole images, will
be efficient at reading the data. The advantage of contiguous storage is that the layout
on disk corresponds directly to the shape of the dataset: stepping through the last index
always means moving through the data in order on disk.
But what if, instead of processing whole images one after another, our application deals
with image tiles ? Let's say we want to read and process the data in a 64×64 pixel slice in
the corner of the first image; for example, say we want to add a logo.
Our slicing selection would be:
>>> tile = dset [ 0 , 0 : 64 , 0 : 64 ]
>>> tile . shape
(64, 64)
Figure 4-1 (B) shows how the data is read in this case. Not so good. Instead of reading
one nice contiguous block of data, our application has to gather data from all over the
place. If we wanted the 64×64 tile from every image at once ( dset[:,0:64,0:64] ), we'd
have to read all the way to the end of the dataset!
The fundamental problem here is that the default contiguous storage mechanism does
not match our access pattern.
Chunked Storage
What if there were some way to express this in advance? Isn't there a way to preserve
the shape of the dataset, which is semantically important, but tell HDF5 to optimize the
dataset for access in 64×64 pixel blocks?
That's what chunking does in HDF5. It lets you specify the N-dimensional “shape” that
best fits your access pattern. When the time comes to write data to disk, HDF5 splits
the data into “chunks” of the specified shape, flattens them, and writes them to disk. The
 
Search WWH ::




Custom Search