Groups, Links, and Iteration: The “H” in HDF5 - Python and HDF5

Databases Reference

In-Depth Information

of items, each of which is orderable according to some scheme like a string name or

numeric identifier, and building up a tree-like “index” to rapidly retrieve an item.

For example, if you have an HDF5 group with a single member, and another group with

a million members, it doesn't take a million times as long to open an object in the latter

group. Group members are indexed by name, so if you know the name of an object then

HDF5 can traverse the index and quickly retrieve the item. The same is true when

creating a new group member; HDF5 doesn't have to “insert” the member into the

middle of a big table somewhere, shuffling all the entries around.

Of course, all of this is transparent to the user. Every group in an HDF5 file comes with

an index that tracks members in alphabetical order. Keep in mind this means “C-style”

alphabetical order (whimsically called “ASCIIbetical” order):

>>> f = h5py . File ( 'iterationdemo.hdf5' , 'w' )

>>> f . create_group ( '1' )

>>> f . create_group ( '2' )

>>> f . create_group ( '10' )

>>> f . create_dataset ( 'data' , ( 100 ,))

>>> f . keys ()

[u'1', u'10', u'2', u'data']

Files can also contain other optional indices, for example those that track object creation

time, but h5py doesn't expose them.

This brings us to the first point: h5py will generally iterate over objects in the file in

alphabetical order (especially for small groups), but you shouldn't rely on this behavior.

Behind the scenes, HDF5 is actually retrieving objects in so-called native order, which

basically means “as fast as possible.” The only thing that's guaranteed is that if you don't

modify the group, the order will remain the same.

Dictionary-Style Iteration

In keeping with the general convention that groups work like dictionaries , iterating over

a group in HDF5 provides the names of the members. Remember, these will be supplied

as Unicode strings:

>>> [ x for x in f ]

[u'1', u'10', u'2', u'data']

There are also iterkeys (equivalent to the preceding), itervalues , and iteritems

methods, which do just what you'd expect:

>>> [ y for y in f . itervalues ()]

[<HDF5 group "/1" (0 members)>,

<HDF5 group "/10" (0 members)>,

<HDF5 group "/2" (0 members)>,

<HDF5 dataset "data": shape (100,), type "<f4">]

Search WWH ::

Custom Search

Home