Databases Reference
In-Depth Information
>>>
out
=
np
.
load
(
"weather.npz"
)
>>>
out
[
"data"
]
array([ 0.44149738, 0.7407523 , 0.44243584, ..., 0.19018119,
0.64844851, 0.55660748])
>>>
out
[
"start_time"
]
array(1375204299)
>>>
out
[
"station"
]
array(15)
So far so good. But what if we have more than one quantity per station? Say there's also
wind speed data to record?
>>>
wind
=
np
.
random
.
random
(
2048
)
>>>
dt_wind
=
5.0
# Wind sampled every 5 seconds
And suppose we have multiple stations. We could introduce some kind of naming con‐
vention, I suppose: “wind_15” for the wind values from station 15, and things like
“dt_wind_15” for the sampling interval. Or we could use multiple files…
In contrast, here's how this application might approach storage with HDF5:
>>>
import
h5py
>>>
f
=
h5py
.
File
(
"weather.hdf5"
)
>>>
f
[
"/15/temperature"
]
=
temperature
>>>
f
[
"/15/temperature"
]
.
attrs
[
"dt"
]
=
10.0
>>>
f
[
"/15/temperature"
]
.
attrs
[
"start_time"
]
=
1375204299
>>>
f
[
"/15/wind"
]
=
wind
>>>
f
[
"/15/wind"
]
.
attrs
[
"dt"
]
=
5.0
---
>>>
f
[
"/20/temperature"
]
=
temperature_from_station_20
---
(and so on)
This example illustrates two of the “killer features” of HDF5: organization in hierarchical
groups and attributes. Groups, like folders in a filesystem, let you store related datasets
together. In this case, temperature and wind measurements from the same weather
station are stored together under groups named “/15,” “/20,” etc. Attributes let you attach
descriptive metadata
directly to the data it describes
. So if you give this file to a colleague,
she can easily discover the information needed to make sense of the data:
>>>
dataset
=
f
[
"/15/temperature"
]
>>>
for
key
,
value
in
dataset
.
attrs
.
iteritems
():
...
print
"
%s
:
%s
"
%
(
key
,
value
)
dt: 10.0
start_time: 1375204299
Coping with Large Data Volumes
As a high-level “glue” language, Python is increasingly being used for rapid visualization
of big datasets and to coordinate large-scale computations that run in compiled lan‐