Working with Datasets - Python and HDF5

Databases Reference

In-Depth Information

>>> out = dset [ 0 : 10 , 20 : 70 ]

>>> out . shape

(10, 50)

Here's what happens behind the scenes when we do the slicing operation:

1. h5py figures out the shape (10, 50) of the resulting array object.

2. An empty NumPy array is allocated of shape (10, 50).

3. HDF5 selects the appropriate part of the dataset.

4. HDF5 copies data from the dataset into the empty NumPy array.

5. The newly filled in NumPy array is returned.

You'll notice that this implies a certain amount of overhead. Not only do we create a

new NumPy array for each slice requested, but we have to figure out what size the array

object should be, check that the selection falls within the bounds of the dataset, and have

HDF5 perform the selection, all before we've read a single byte of data.

This leads us to the first and most important performance tip when using datasets: take

reasonably sized slices .

Here's an example: using our (100, 1000)-shape dataset, which of the following do you

think is likely to be faster?

# Check for negative values and clip to 0

for ix in xrange ( 100 ):

for iy in xrange ( 1000 ):

val = dset [ ix , iy ] # Read one element

if val < 0 : dset [ ix , iy ] = 0 # Clip to 0 if needed

or

# Check for negative values and clip to 0

for ix in xrange ( 100 ):

val = dset [ ix ,:] # Read one row

val [ val < 0 ] = 0 # Clip negative values to 0

dset [ ix ,:] = val # Write row back out

In the first case, we perform 100,000 slicing operations. In the second, we perform only

100.

This may seem like a trivial example, but the first example creeps into real-world code

frequently; using fast in-memory slices on NumPy arrays, it is actually reasonably quick

on modern machines. But once you start going through the whole slice-allocate-HDF5-

read pipeline outlined here, things start to bog down.

The same applies to writing, although fewer steps are involved. When you perform a

write operation, for example:

>>> some_dset [ 0 : 10 , 20 : 70 ] = out * 2

Search WWH ::

Custom Search

Home