Databases Reference
In-Depth Information
Python and HDF5
In the Python world, consensus is rapidly converging on Hierarchical Data Format
version 5, or “HDF5,” as the standard mechanism for storing large quantities of nu‐
merical data. As data volumes get larger, organization of data becomes increasingly
important; features in HDF5 like named datasets ( Chapter 3 ), hierarchically organized
groups ( Chapter 5 ), and user-defined metadata “attributes” ( Chapter 6 ) become essen‐
tial to the analysis process.
Structured, “self-describing” formats like HDF5 are a natural complement to Python.
Two production-ready, feature-rich interface packages exist for HDF5, h5py, and PyT‐
ables, along with a number of smaller special-purpose wrappers.
Organizing Data and Metadata
Here's a simple example of how HDF5's structuring capability can help an application.
Don't worry too much about the details; later chapters explain both the details of how
the file is structured, and how to use the HDF5 API from Python. Consider this a taste
of what HDF5 can do for your application. If you want to follow along, you'll need
Python 2 with NumPy installed (see Chapter 2 ).
Suppose we have a NumPy array that represents some data from an experiment:
>>> import numpy as np
>>> temperature = np . random . random ( 1024 )
>>> temperature
array([ 0.44149738, 0.7407523 , 0.44243584, ..., 0.19018119,
0.64844851, 0.55660748])
Let's also imagine that these data points were recorded from a weather station that
sampled the temperature, say, every 10 seconds. In order to make sense of the data, we
have to record that sampling interval, or “delta-T,” somewhere. For now we'll put it in
a Python variable:
>>> dt = 10.0
The data acquisition started at a particular time, which we will also need to record. And
of course, we have to know that the data came from Weather Station 15:
>>> start_time = 1375204299 # in Unix time
>>> station = 15
We could use the built-in NumPy function np.savez to store these values on disk. This
simple function saves the values as NumPy arrays, packed together in a ZIP file with
associated names:
>>> np . savez ( "weather.npz" , data = temperature , start_time = start_time , station =
station )
We can get the values back from the file with np.load :
Search WWH ::




Custom Search