Groups, Links, and Iteration: The “H” in HDF5 - Python and HDF5

Databases Reference

In-Depth Information

>>> data = do_large_calculation ()

>>> with h5py . File ( 'output.hdf5' ) as f :

... f . create_dataset ( 'results' , data = data )

If there are many datasets and groups in the file, it might not be appropriate to overwrite

the entire file every time the code runs. But if we don't open in w mode, then our program

will only work the first time, unless we manually remove the output file every time it

runs.

To deal with this, create_group and create_dataset have companion methods called

require_group and require_dataset . They do exactly the same thing, only first they

check for an existing group or dataset and return it instead.

Both versions take exactly the same arguments and keywords. In the case of require_da

taset , h5py also checks the requested shape and dtype against any existing dataset and

fails if they don't match:

>>> f . create_dataset ( 'dataset' , ( 100 ,), dtype = 'i' )

>>> f . require_dataset ( 'dataset' , ( 100 ,), dtype = 'f' )

TypeError: Datatypes cannot be safely cast (existing int32 vs new f)

There's a minor detail here, in that a conflict is only deemed to occur if the shapes don't

match, or the requested precision of the datatype is higher than the existing precision.

So if there's a preexisting int64 dataset, then require_dataset will succeed if int32 is

requested:

>>> f . create_dataset ( 'int_dataset' , ( 100 ,), dtype = 'int64' )

>>> f . require_dataset ( 'int_dataset' , ( 100 ,), dtype = 'int32' )

The NumPy casting rules are used to check for conflicts; you can test the types yourself

using np.can_cast .

Iteration and Containership

Iteration is a core Python concept, key to writing “Pythonic” code that runs quickly and

that your colleagues can understand. It's also a natural way to explore the contents of

groups.

How Groups Are Actually Stored

In the HDF5 file, group members are indexed using a structure called a “B-tree.” This

isn't a computer science text, so we won't spend too long on the subject, but it's valuable

to have a rough understanding of what's going on behind the scenes, especially if you're

dealing with groups that have thousands or hundreds of thousands of items.

“B-trees” are data structures that are great for keeping track of large numbers of items,

while still making retrieval (and addition) of items fast. They work by taking a collection

Search WWH ::

Custom Search

Home