Databases Reference
In-Depth Information
the data across all 100 remaining indices. It's as efficient as you can get; there's only one
slicing operation, and the remainder of the time is spent writing data to disk.
Reading Directly into an Existing Array
Finally we come full circle back to
read_direct
, one of the most powerful methods
available on the
Dataset
object. It's as close as you can get to the “traditional” C interface
of HDF5, without getting into the internal details of h5py.
To recap, you can use
read_direct
to have HDF5 “fill in” data into an existing array,
automatically performing type conversion. Last time we saw how to read
float32
data
into a
float64
NumPy array:
>>>
dset
.
dtype
dtype('float32')
>>>
out
=
np
.
empty
((
100
,
1000
),
dtype
=
np
.
float64
)
>>>
dset
.
read_direct
(
out
)
This works, but requires you to read the entire dataset in one go. Let's pick a more useful
example. Suppose we wanted to read the first time trace, at
dset[0,:]
, and deposit it
into the
out
array at
out[50,:]
. We can use the
source_sel
and
dest_sel
keywords,
for
source selection
and
destination selection
respectively:
>>>
dset
.
read_direct
(
out
,
source_sel
=
np
.
s_
[
0
,:],
dest_sel
=
np
.
s_
[
50
,:])
The odd-looking
np.s_
is a gadget that takes slices, in the ordinary array-slicing syntax,
and returns a NumPy
slice
object with the corresponding information.
By the way, you don't have to match the shape of your output array to the dataset. Suppose
our application wanted to compute the mean of the first 50 data points in each time
trace, a common scenario when estimating DC offsets in real-world experimental data.
You could do this using the standard slicing techniques:
>>>
out
=
dset
[:,
0
:
50
]
>>>
out
.
shape
(100, 50)
>>>
means
=
out
.
mean
(
axis
=
1
)
>>>
means
.
shape
(100,)
Using
read_direct
this would look like:
>>>
out
=
np
.
empty
((
100
,
50
),
dtype
=
np
.
float32
)
>>>
dset
.
read_direct
(
out
,
np
.
s_
[:,
0
:
50
])
# dest_sel can be omitted
>>>
means
=
out
.
mean
(
axis
=
1
)
This may seem like a trivial case, but there's an important difference between the two
approaches. In the first example, the
out
array is created internally by h5py, used to
store the slice, and then thrown away. In the second example,
out
is allocated by the
user, and can be reused for future calls to
read_direct
.