Databases Reference
In-Depth Information
guages like C and FORTRAN. It's now relatively common to deal with datasets hundreds
of gigabytes or even terabytes in size; HDF5 itself can scale up to exabytes.
On all but the biggest machines, it's not feasible to load such datasets directly into
memory. One of HDF5's greatest strengths is its support for subsetting and partial I/O.
For example, let's take the 1024-element “temperature” dataset we created earlier:
>>>
dataset
=
f
[
"/15/temperature"
]
Here, the object named
dataset
is a proxy object representing an HDF5 dataset. It
supports array-like slicing operations, which will be familiar to frequent NumPy users:
>>>
dataset
[
0
:
10
]
array([ 0.44149738, 0.7407523 , 0.44243584, 0.3100173 , 0.04552416,
0.43933469, 0.28550775, 0.76152561, 0.79451732, 0.32603454])
>>>
dataset
[
0
:
10
:
2
]
array([ 0.44149738, 0.44243584, 0.04552416, 0.28550775, 0.79451732])
Keep in mind that the actual data lives on disk; when slicing is applied to an HDF5
dataset, the appropriate data is found and loaded into memory. Slicing in this fashion
leverages the underlying subsetting capabilities of HDF5 and is consequently very fast.
Another great thing about HDF5 is that you have control over how storage is allocated.
For example, except for some metadata, a brand new dataset takes
zero
space, and by
default bytes are only used on disk to hold the data you actually write.
For example, here's a 2-terabyte dataset you can create on just about any computer:
>>>
big_dataset
=
f
.
create_dataset
(
"big"
,
shape
=
(
1024
,
1024
,
1024
,
512
),
dtype
=
'float32'
)
Although no storage is yet allocated, the entire “space” of the dataset is available to us.
We can write anywhere in the dataset, and only the bytes on disk necessary to hold the
data are used:
>>>
big_dataset
[
344
,
678
,
23
,
36
]
=
42.0
When storage is at a premium, you can even use transparent compression on a dataset-
by-dataset basis (see
Chapter 4
):
>>>
compressed_dataset
=
f
.
create_dataset
(
"comp"
,
shape
=
(
1024
,),
dtype
=
'int32'
,
compression
=
'gzip'
)
>>>
compressed_dataset
[:]
=
np
.
arange
(
1024
)
>>>
compressed_dataset
[:]
array([ 0, 1, 2, ..., 1021, 1022, 1023])
What Exactly Is HDF5?
HDF5 is a great mechanism for storing
large numerical arrays of homogenous type
, for
data models that can be
organized hierarchically
and benefit from tagging of datasets
with
arbitrary metadata
.