Working with Datasets - Python and HDF5

Databases Reference

In-Depth Information

CHAPTER 3

Working with Datasets

Datasets are the central feature of HDF5. You can think of them as NumPy arrays that

live on disk. Every dataset in HDF5 has a name, a type, and a shape, and supports random

access. Unlike the built-in np.save and friends, there's no need to read and write the

entire array as a block; you can use the standard NumPy syntax for slicing to read and

write just the parts you want.

Dataset Basics

First, let's create a file so we have somewhere to store our datasets:

>>> f = h5py . File ( "testfile.hdf5" )

Every dataset in an HDF5 file has a name. Let's see what happens if we just assign a new

NumPy array to a name in the file:

>>> arr = np . ones (( 5 , 2 ))

>>> f [ "my dataset" ] = arr

>>> dset = f [ "my dataset" ]

>>> dset

We put in a NumPy array but got back something else: an instance of the class

h5py.Dataset . This is a “proxy” object that lets you read and write to the underlying

HDF5 dataset on disk.

Type and Shape

Let's explore the Dataset object. If you're using IPython, type dset. and hit Tab to see

the object's attributes; otherwise, do dir(dset) . There are a lot, but a few stand out:

>>> dset . dtype

dtype('float64')

Search WWH ::

Custom Search

Home