Databases Reference
In-Depth Information
guages like C and FORTRAN. It's now relatively common to deal with datasets hundreds
of gigabytes or even terabytes in size; HDF5 itself can scale up to exabytes.
On all but the biggest machines, it's not feasible to load such datasets directly into
memory. One of HDF5's greatest strengths is its support for subsetting and partial I/O.
For example, let's take the 1024-element “temperature” dataset we created earlier:
>>> dataset = f [ "/15/temperature" ]
Here, the object named dataset is a proxy object representing an HDF5 dataset. It
supports array-like slicing operations, which will be familiar to frequent NumPy users:
>>> dataset [ 0 : 10 ]
array([ 0.44149738, 0.7407523 , 0.44243584, 0.3100173 , 0.04552416,
0.43933469, 0.28550775, 0.76152561, 0.79451732, 0.32603454])
>>> dataset [ 0 : 10 : 2 ]
array([ 0.44149738, 0.44243584, 0.04552416, 0.28550775, 0.79451732])
Keep in mind that the actual data lives on disk; when slicing is applied to an HDF5
dataset, the appropriate data is found and loaded into memory. Slicing in this fashion
leverages the underlying subsetting capabilities of HDF5 and is consequently very fast.
Another great thing about HDF5 is that you have control over how storage is allocated.
For example, except for some metadata, a brand new dataset takes zero space, and by
default bytes are only used on disk to hold the data you actually write.
For example, here's a 2-terabyte dataset you can create on just about any computer:
>>> big_dataset = f . create_dataset ( "big" , shape = ( 1024 , 1024 , 1024 , 512 ),
dtype = 'float32' )
Although no storage is yet allocated, the entire “space” of the dataset is available to us.
We can write anywhere in the dataset, and only the bytes on disk necessary to hold the
data are used:
>>> big_dataset [ 344 , 678 , 23 , 36 ] = 42.0
When storage is at a premium, you can even use transparent compression on a dataset-
by-dataset basis (see Chapter 4 ):
>>> compressed_dataset = f . create_dataset ( "comp" , shape = ( 1024 ,), dtype = 'int32' ,
compression = 'gzip' )
>>> compressed_dataset [:] = np . arange ( 1024 )
>>> compressed_dataset [:]
array([ 0, 1, 2, ..., 1021, 1022, 1023])
What Exactly Is HDF5?
HDF5 is a great mechanism for storing large numerical arrays of homogenous type , for
data models that can be organized hierarchically and benefit from tagging of datasets
with arbitrary metadata .
Search WWH ::




Custom Search