Databases Reference
In-Depth Information
Because Dataset objects are so similar to NumPy arrays, you may be
tempted to mix them in with computational code. This may work for
a while, but generally causes performance problems as the data is on
disk instead of in memory.
Creating Empty Datasets
You don't need to have a NumPy array at the ready to create a dataset. The method
create_dataset on our File object can create empty datasets from a shape and type,
or even just a shape (in which case the type will be np.float32 , native single-precision
float):
>>> dset = f . create_dataset ( "test1" , ( 10 , 10 ))
>>> dset
<HDF5 dataset "test1": shape (10, 10), type "<f4">
>>> dset = f . create_dataset ( "test2" , ( 10 , 10 ), dtype = np . complex64 )
>>> dset
<HDF5 dataset "test2": shape (10, 10), type "<c8">
HDF5 is smart enough to only allocate as much space on disk as it actually needs to
store the data you write. Here's an example: suppose you want to create a 1D dataset
that can hold 4 gigabytes worth of data samples from a long-running experiment:
>>> dset = f . create_dataset ( "big dataset" , ( 1024 ** 3 ,), dtype = np . float32 )
Now write some data to it. To be fair, we also ask HDF5 to flush its buffers and actually
write to disk:
>>> dset [ 0 : 1024 ] = np . arange ( 1024 )
>>> f . flush ()
Looking at the file size on disk:
$ ls -lh testfile.hdf5
-rw-r--r-- 1 computer computer 66K Mar 6 21:23 testfile.hdf5
Saving Space with Explicit Storage Types
When it comes to types, a few seconds of thought can save you a lot of disk space and
also reduce your I/O time. The create_dataset method can accept almost any NumPy
dtype for the underlying dataset, and crucially, it doesn't have to exactly match the type
of data you later write to the dataset.
Here's an example: one common use for HDF5 is to store numerical floating-point data
—for example, time series from digitizers, stock prices, computer simulations—any‐
where it's necessary to represent “real-world” numbers that aren't integers.
Often, to keep the accuracy of calculations high, 8-byte double-precision numbers will
be used in memory (NumPy dtype float64 ), to minimize the effects of rounding error.
Search WWH ::




Custom Search