How Chunking and Compression Can Help You - Python and HDF5

Databases Reference

In-Depth Information

CHAPTER 4

How Chunking and Compression

Can Help You

So far we have avoided talking about exactly how the data you write is stored on disk.

Some of the most interesting features in HDF5, including per-dataset compression, are

tied up in the details of how data is arranged on disk.

Before we get down to the nuts and bolts, there's a more fundamental issue we have to

discuss: how multidimensional arrays are actually handled in Python and HDF5.

Contiguous Storage

Let's suppose we have a four-element NumPy array of strings:

>>> a = np . array ([ [ "A" , "B" ], [ "C" , "D" ] ])

>>> print a

[['A' 'B']

['C' 'D']]

Mathematically, this is a two-dimensional object. It has two axes, and can be indexed

using a pair of numbers in the range 0 to 1:

>>> a [ 1 , 1 ]

'D'

However, there's no such thing as “two-dimensional” computer memory, at least not in

common use. The elements are actually stored in a one-dimensional buffer:

'A' 'B' 'C' 'D'

This is called contiguous storage, because all the elements of the array, whether it's stored

on disk or in memory, are stored one after another. NumPy uses a simple set of rules to

turn an indexing expression into the appropriate offset into this one-dimensional buffer.

Search WWH ::

Custom Search

Home