Working with Datasets - Python and HDF5

Databases Reference

In-Depth Information

array([[0, 0],

[0, 0]])

For some applications, it's nice to pick a default value other than 0. You might want to

set unmodified elements to -1, or even NaN for floating-point datasets.

HDF5 addresses this with a fill value , which is the value returned for the areas of a dataset

that haven't been written to. Fill values are handled when data is read, so they don't cost

you anything in terms of storage space. They're defined when the dataset is created, and

can't be changed:

>>> dset = f . create_dataset ( 'filled' , ( 2 , 2 ), dtype = np . int32 , fillvalue = 42 )

>>> dset [ ... ]

array([[42, 42],

[42, 42]])

A dataset's fill value is available on the fillvalue property:

>>> dset . fillvalue

42

Reading and Writing Data

Your main day-to-day interaction with Dataset objects will look a lot like your inter‐

actions with NumPy arrays. One of the design goals for the h5py package was to “recycle”

as many NumPy metaphors as possible for datasets, so that you can interact with them

in a familiar way.

Even if you're an experienced NumPy user, don't skip this section! There are important

performance differences and implementation subtleties between the two that may trip

you up.

Before we dive into the nuts and bolts of reading from and writing to datasets, it's

important to spend a few minutes discussing how Dataset objects aren't like NumPy

arrays, especially from a performance perspective.

Using Slicing Effectively

In order to use Dataset objects efficiently, we have to know a little about what goes on

behind the scenes. Let's take the example of reading from an existing dataset. Suppose

we have the (100, 1000)-shape array from the previous example:

>>> dset = f2 [ 'big' ]

>>> dset

Now we request a slice:

Search WWH ::

Custom Search

Home