Databases Reference
In-Depth Information
>>> out = dset [ 0 : 10 , 20 : 70 ]
>>> out . shape
(10, 50)
Here's what happens behind the scenes when we do the slicing operation:
1. h5py figures out the shape (10, 50) of the resulting array object.
2. An empty NumPy array is allocated of shape (10, 50).
3. HDF5 selects the appropriate part of the dataset.
4. HDF5 copies data from the dataset into the empty NumPy array.
5. The newly filled in NumPy array is returned.
You'll notice that this implies a certain amount of overhead. Not only do we create a
new NumPy array for each slice requested, but we have to figure out what size the array
object should be, check that the selection falls within the bounds of the dataset, and have
HDF5 perform the selection, all before we've read a single byte of data.
This leads us to the first and most important performance tip when using datasets: take
reasonably sized slices .
Here's an example: using our (100, 1000)-shape dataset, which of the following do you
think is likely to be faster?
# Check for negative values and clip to 0
for ix in xrange ( 100 ):
for iy in xrange ( 1000 ):
val = dset [ ix , iy ] # Read one element
if val < 0 : dset [ ix , iy ] = 0 # Clip to 0 if needed
or
# Check for negative values and clip to 0
for ix in xrange ( 100 ):
val = dset [ ix ,:] # Read one row
val [ val < 0 ] = 0 # Clip negative values to 0
dset [ ix ,:] = val # Write row back out
In the first case, we perform 100,000 slicing operations. In the second, we perform only
100.
This may seem like a trivial example, but the first example creeps into real-world code
frequently; using fast in-memory slices on NumPy arrays, it is actually reasonably quick
on modern machines. But once you start going through the whole slice-allocate-HDF5-
read pipeline outlined here, things start to bog down.
The same applies to writing, although fewer steps are involved. When you perform a
write operation, for example:
>>> some_dset [ 0 : 10 , 20 : 70 ] = out * 2
Search WWH ::




Custom Search