Working with Datasets - Python and HDF5

Databases Reference

In-Depth Information

Because Dataset objects are so similar to NumPy arrays, you may be

tempted to mix them in with computational code. This may work for

a while, but generally causes performance problems as the data is on

disk instead of in memory.

Creating Empty Datasets

You don't need to have a NumPy array at the ready to create a dataset. The method

create_dataset on our File object can create empty datasets from a shape and type,

or even just a shape (in which case the type will be np.float32 , native single-precision

float):

>>> dset = f . create_dataset ( "test1" , ( 10 , 10 ))

>>> dset

>>> dset = f . create_dataset ( "test2" , ( 10 , 10 ), dtype = np . complex64 )

>>> dset

HDF5 is smart enough to only allocate as much space on disk as it actually needs to

store the data you write. Here's an example: suppose you want to create a 1D dataset

that can hold 4 gigabytes worth of data samples from a long-running experiment:

>>> dset = f . create_dataset ( "big dataset" , ( 1024 ** 3 ,), dtype = np . float32 )

Now write some data to it. To be fair, we also ask HDF5 to flush its buffers and actually

write to disk:

>>> dset [ 0 : 1024 ] = np . arange ( 1024 )

>>> f . flush ()

Looking at the file size on disk:

$ ls -lh testfile.hdf5

-rw-r--r-- 1 computer computer 66K Mar 6 21:23 testfile.hdf5

Saving Space with Explicit Storage Types

When it comes to types, a few seconds of thought can save you a lot of disk space and

also reduce your I/O time. The create_dataset method can accept almost any NumPy

dtype for the underlying dataset, and crucially, it doesn't have to exactly match the type

of data you later write to the dataset.

Here's an example: one common use for HDF5 is to store numerical floating-point data

—for example, time series from digitizers, stock prices, computer simulations—any‐

where it's necessary to represent “real-world” numbers that aren't integers.

Often, to keep the accuracy of calculations high, 8-byte double-precision numbers will

be used in memory (NumPy dtype float64 ), to minimize the effects of rounding error.

Search WWH ::

Custom Search

Home