Introduction - Python and HDF5

Databases Reference

In-Depth Information

guages like C and FORTRAN. It's now relatively common to deal with datasets hundreds

of gigabytes or even terabytes in size; HDF5 itself can scale up to exabytes.

On all but the biggest machines, it's not feasible to load such datasets directly into

memory. One of HDF5's greatest strengths is its support for subsetting and partial I/O.

For example, let's take the 1024-element “temperature” dataset we created earlier:

>>> dataset = f [ "/15/temperature" ]

Here, the object named dataset is a proxy object representing an HDF5 dataset. It

supports array-like slicing operations, which will be familiar to frequent NumPy users:

>>> dataset [ 0 : 10 ]

array([ 0.44149738, 0.7407523 , 0.44243584, 0.3100173 , 0.04552416,

0.43933469, 0.28550775, 0.76152561, 0.79451732, 0.32603454])

>>> dataset [ 0 : 10 : 2 ]

array([ 0.44149738, 0.44243584, 0.04552416, 0.28550775, 0.79451732])

Keep in mind that the actual data lives on disk; when slicing is applied to an HDF5

dataset, the appropriate data is found and loaded into memory. Slicing in this fashion

leverages the underlying subsetting capabilities of HDF5 and is consequently very fast.

Another great thing about HDF5 is that you have control over how storage is allocated.

For example, except for some metadata, a brand new dataset takes zero space, and by

default bytes are only used on disk to hold the data you actually write.

For example, here's a 2-terabyte dataset you can create on just about any computer:

>>> big_dataset = f . create_dataset ( "big" , shape = ( 1024 , 1024 , 1024 , 512 ),

dtype = 'float32' )

Although no storage is yet allocated, the entire “space” of the dataset is available to us.

We can write anywhere in the dataset, and only the bytes on disk necessary to hold the

data are used:

>>> big_dataset [ 344 , 678 , 23 , 36 ] = 42.0

When storage is at a premium, you can even use transparent compression on a dataset-

by-dataset basis (see Chapter 4 ):

>>> compressed_dataset = f . create_dataset ( "comp" , shape = ( 1024 ,), dtype = 'int32' ,

compression = 'gzip' )

>>> compressed_dataset [:] = np . arange ( 1024 )

>>> compressed_dataset [:]

array([ 0, 1, 2, ..., 1021, 1022, 1023])

What Exactly Is HDF5?

HDF5 is a great mechanism for storing large numerical arrays of homogenous type , for

data models that can be organized hierarchically and benefit from tagging of datasets

with arbitrary metadata .

Search WWH ::

Custom Search

Home