Working with Datasets - Python and HDF5

Databases Reference

In-Depth Information

array([ 0.98885498, 0. , 0. , 0. , 0.66211931,

0.45692186, 0.07123649, 0. , 0.22059144, 0. ])

On the HDF5 side, this is handled by transforming the Boolean array into a list of

coordinates in the dataset. There are a couple of consequences as a result.

First, for very large indexing expressions with lots of True values, it may be faster to,

for example, modify the data on the Python side and write the dataset out again. If you

suspect a slowdown it's a good idea to test this.

Second, the expression on the right-hand side has to be either a scalar, or an array with

exactly the right number of points. This isn't quite as burdensome a requirement as it

might seem. If the number of elements that meet the criteria is small, it's actually a very

effective way to “update” the dataset.

For example, what if instead of clipping the negative values to zero, we wanted to flip

them and make them positive? We could modify the original array and write the entire

thing back out to disk. Or, we could modify just the elements we want:

>>> dset [ data < 0 ] = - 1 * data [ data < 0 ]

>>> dset [ ... ]

array([ 0.98885498, 0.28554781, 0.17157685, 0.05227003, 0.66211931,

0.45692186, 0.07123649, 0.40374417, 0.22059144, 0.82367672])

Note that the number of elements (five, in this case) is the same on the left- and righthand

sides of the preceding assignment.

Coordinate Lists

There's another feature borrowed from NumPy, with a few modifications. When slicing

into a dataset, for any axis, instead of a x:y:z -style slicing expression you can supply a

list of indices. Let's use our 10-element range dataset again:

>>> dset = f [ 'range' ]

>>> dset [ ... ]

array([0,1,2,3,4,5,6,7,8,9])

Suppose we wanted just elements 1, 2, and 7. We could manually extract them one at a

time as dset[1] , dset[2] , and dset[7] . We could also use a Boolean indexing array

with its values set to True at locations 1, 2, and 7.

Or, we could simply specify the elements desired using a list:

>>> dset [ [ 1 , 2 , 7 ] ]

array([1,2,7])

This may seem trivial, but it's implemented in a way that is much more efficient than

Boolean masking for large datasets. Instead of generating a laundry list of coordinates

to access, h5py breaks the selection down into contiguous “subselections,” which are

much faster when multiple axes are involved.

Search WWH ::

Custom Search

Home