Introduction - Python and HDF5

Databases Reference

In-Depth Information

It's quite different from SQL-style relational databases. HDF5 has quite a few organi‐

zational tricks up its sleeve (see Chapter 8 , for example), but if you find yourself needing

to enforce relationships between values in various tables, or wanting to perform JOINs

on your data, a relational database is probably more appropriate. Likewise, for tiny 1D

datasets you need to be able to read on machines without HDF5 installed. Text formats

like CSV (with all their warts) are a reasonable alternative.

HDF5 is just about perfect if you make minimal use of relational features and have a

need for very high performance, partial I/O, hierarchical organization, and arbitrary

metadata.

So what, specifically, is “HDF5”? I would argue it consists of three things:

1. A file specification and associated data model.

2. A standard library with API access available from C, C++, Java, Python, and others.

3. A software ecosystem, consisting of both client programs using HDF5 and “analysis

platforms” like MATLAB, IDL, and Python.

HDF5: The File

In the preceding brief examples, you saw the three main elements of the HDF5 data

model: datasets , array-like objects that store your numerical data on disk; groups , hier‐

archical containers that store datasets and other groups; and attributes , user-defined

bits of metadata that can be attached to datasets (and groups!).

Using these basic abstractions, users can build specific “application formats” that orga‐

nize data in a method appropriate for the problem domain. For example, our “weather

station” code used one group for each station, and separate datasets for each measured

parameter, with attributes to hold additional information about what the datasets mean.

It's very common for laboratories or other organizations to agree on such a “format-

within-a-format” that specifies what arrangement of groups, datasets, and attributes are

to be used to store information.

Since HDF5 takes care of all cross-platform issues like endianness, sharing data with

other groups becomes a simple matter of manipulating groups, datasets, and attributes

to get the desired result. And because the files are self-describing , even knowing about

the application format isn't usually necessary to get data out of the file. You can simply

open the file and explore its contents:

>>> f . keys ()

[u'15', u'big', u'comp']

>>> f [ "/15" ] . keys ()

[u'temperature', u'wind']

Search WWH ::

Custom Search

Home