How Chunking and Compression Can Help You - Python and HDF5

Databases Reference

In-Depth Information

Figure 4-1. Contiguous storage on disk, accessing (A) an entire image all at once, and

(B) a 64x64 image tile. Gray regions are the data being retrieved.

It's easy to see that applications reading a whole image, or a series of whole images, will

be efficient at reading the data. The advantage of contiguous storage is that the layout

on disk corresponds directly to the shape of the dataset: stepping through the last index

always means moving through the data in order on disk.

But what if, instead of processing whole images one after another, our application deals

with image tiles ? Let's say we want to read and process the data in a 64×64 pixel slice in

the corner of the first image; for example, say we want to add a logo.

Our slicing selection would be:

>>> tile = dset [ 0 , 0 : 64 , 0 : 64 ]

>>> tile . shape

(64, 64)

Figure 4-1 (B) shows how the data is read in this case. Not so good. Instead of reading

one nice contiguous block of data, our application has to gather data from all over the

place. If we wanted the 64×64 tile from every image at once ( dset[:,0:64,0:64] ), we'd

have to read all the way to the end of the dataset!

The fundamental problem here is that the default contiguous storage mechanism does

not match our access pattern.

Chunked Storage

What if there were some way to express this in advance? Isn't there a way to preserve

the shape of the dataset, which is semantically important, but tell HDF5 to optimize the

dataset for access in 64×64 pixel blocks?

That's what chunking does in HDF5. It lets you specify the N-dimensional “shape” that

best fits your access pattern. When the time comes to write data to disk, HDF5 splits

the data into “chunks” of the specified shape, flattens them, and writes them to disk. The

Search WWH ::

Custom Search

Home