Databases Reference
In-Depth Information
>>> data = do_large_calculation ()
>>> with h5py . File ( 'output.hdf5' ) as f :
... f . create_dataset ( 'results' , data = data )
If there are many datasets and groups in the file, it might not be appropriate to overwrite
the entire file every time the code runs. But if we don't open in w mode, then our program
will only work the first time, unless we manually remove the output file every time it
runs.
To deal with this, create_group and create_dataset have companion methods called
require_group and require_dataset . They do exactly the same thing, only first they
check for an existing group or dataset and return it instead.
Both versions take exactly the same arguments and keywords. In the case of require_da
taset , h5py also checks the requested shape and dtype against any existing dataset and
fails if they don't match:
>>> f . create_dataset ( 'dataset' , ( 100 ,), dtype = 'i' )
>>> f . require_dataset ( 'dataset' , ( 100 ,), dtype = 'f' )
TypeError: Datatypes cannot be safely cast (existing int32 vs new f)
There's a minor detail here, in that a conflict is only deemed to occur if the shapes don't
match, or the requested precision of the datatype is higher than the existing precision.
So if there's a preexisting int64 dataset, then require_dataset will succeed if int32 is
requested:
>>> f . create_dataset ( 'int_dataset' , ( 100 ,), dtype = 'int64' )
>>> f . require_dataset ( 'int_dataset' , ( 100 ,), dtype = 'int32' )
The NumPy casting rules are used to check for conflicts; you can test the types yourself
using np.can_cast .
Iteration and Containership
Iteration is a core Python concept, key to writing “Pythonic” code that runs quickly and
that your colleagues can understand. It's also a natural way to explore the contents of
groups.
How Groups Are Actually Stored
In the HDF5 file, group members are indexed using a structure called a “B-tree.” This
isn't a computer science text, so we won't spend too long on the subject, but it's valuable
to have a rough understanding of what's going on behind the scenes, especially if you're
dealing with groups that have thousands or hundreds of thousands of items.
“B-trees” are data structures that are great for keeping track of large numbers of items,
while still making retrieval (and addition) of items fast. They work by taking a collection
Search WWH ::




Custom Search