Working with Datasets - Python and HDF5

Databases Reference

In-Depth Information

>>> dset [ ... ]

array([[1, 2],

[3, 4]])

We'll try the same resizing as in the NumPy example:

>>> dset . resize (( 1 , 4 ))

>>> dset [ ... ]

array([[1, 2, 0, 0]])

>>> dset . resize (( 1 , 10 ))

>>> dset [ ... ]

array([[1, 2, 0, 0, 0, 0, 0, 0, 0, 0]])

What's going on here? When we changed the shape from (2, 2) to (1, 4), the data at

locations dset[1,0] and dset[1,1] didn't get reshuffled; it was lost. For this reason,

you should be very careful when using resize ; the reshuffling tricks you've learned in

the NumPy world will quickly lead to trouble.

Finally, you'll notice that in this case the new elements are initialized to zero. In general,

they will be set to the dataset's fill value (see “Fill Values” on page 26 ).

When and How to Use resize

One of the most common questions about HDF5 is how to “append” to a dataset. With

resize , this can be done if care is taken with respect to performance.

For example, let's say we have another dataset storing 1000-element time traces. How‐

ever, this time our application doesn't know how many to store. It could be 10, or 100,

or 1000. One approach might be this:

dset1 = f . create_dataset ( 'timetraces' , ( 1 , 1000 ), maxshape = ( None , 1000 ))

def add_trace_1 ( arr ):

dset1 . resize ( ( dset1 . shape [ 0 ] + 1 , 1000 ) )

dset1 [ - 1 ,:] = arr

Here, every time a new 1000-element array is added, the dataset is simply expanded by

a single entry. But if the number of resize calls is equal to the number of insertions,

this doesn't scale well, particularly if traces will be added thousands of times.

An alternative approach might be to keep track of the number of insertions and then

“trim” the dataset when done:

dset2 = f . create_dataset ( 'timetraces2' , ( 5000 , 1000 ), maxshape = ( None , 1000 ))

ntraces = 0

def add_trace_2 ( arr ):

global ntraces

dset2 [ ntraces ,:] = arr

ntraces += 1

Search WWH ::

Custom Search

Home