Databases Reference
In-Depth Information
array([ 0.98885498, 0. , 0. , 0. , 0.66211931,
0.45692186, 0.07123649, 0. , 0.22059144, 0. ])
On the HDF5 side, this is handled by transforming the Boolean array into a list of
coordinates in the dataset. There are a couple of consequences as a result.
First, for very large indexing expressions with lots of True values, it may be faster to,
for example, modify the data on the Python side and write the dataset out again. If you
suspect a slowdown it's a good idea to test this.
Second, the expression on the right-hand side has to be either a scalar, or an array with
exactly the right number of points. This isn't quite as burdensome a requirement as it
might seem. If the number of elements that meet the criteria is small, it's actually a very
effective way to “update” the dataset.
For example, what if instead of clipping the negative values to zero, we wanted to flip
them and make them positive? We could modify the original array and write the entire
thing back out to disk. Or, we could modify just the elements we want:
>>> dset [ data < 0 ] = - 1 * data [ data < 0 ]
>>> dset [ ... ]
array([ 0.98885498, 0.28554781, 0.17157685, 0.05227003, 0.66211931,
0.45692186, 0.07123649, 0.40374417, 0.22059144, 0.82367672])
Note that the number of elements (five, in this case) is the same on the left- and righthand
sides of the preceding assignment.
Coordinate Lists
There's another feature borrowed from NumPy, with a few modifications. When slicing
into a dataset, for any axis, instead of a x:y:z -style slicing expression you can supply a
list of indices. Let's use our 10-element range dataset again:
>>> dset = f [ 'range' ]
>>> dset [ ... ]
array([0,1,2,3,4,5,6,7,8,9])
Suppose we wanted just elements 1, 2, and 7. We could manually extract them one at a
time as dset[1] , dset[2] , and dset[7] . We could also use a Boolean indexing array
with its values set to True at locations 1, 2, and 7.
Or, we could simply specify the elements desired using a list:
>>> dset [ [ 1 , 2 , 7 ] ]
array([1,2,7])
This may seem trivial, but it's implemented in a way that is much more efficient than
Boolean masking for large datasets. Instead of generating a laundry list of coordinates
to access, h5py breaks the selection down into contiguous “subselections,” which are
much faster when multiple axes are involved.
Search WWH ::




Custom Search