Working with Datasets - Python and HDF5

Databases Reference

In-Depth Information

However, it's common practice to store these data points on disk as single-precision , 4-

byte numbers ( float32 ), saving a factor of 2 in file size.

Let's suppose we have such a NumPy array called bigdata :

>>> bigdata = np . ones (( 100 , 1000 ))

>>> bigdata . dtype

dtype('float64')

>>> bigdata . shape

(100, 1000)

We could store this in a file by simple assignment, resulting in a double-precision dataset:

>>> with h5py . File ( 'big1.hdf5' , 'w' ) as f1 :

... f1 [ 'big' ] = bigdata

$ ls -lh big1.hdf5

-rw-r--r-- 1 computer computer 784K Apr 13 14:40 foo.hdf5

Or we could request that HDF5 store it as single-precision data:

>>> with h5py . File ( 'big2.hdf5' , 'w' ) as f2 :

... f2 . create_dataset ( 'big' , data = bigdata , dtype = np . float32 )

$ ls -lh big2.hdf5

-rw-r--r-- 1 computer computer 393K Apr 13 14:42 foo.hdf5

Keep in mind that whichever one you choose, your data will emerge from the file in that

format:

>>> f1 = h5py . File ( "big1.hdf5" )

>>> f2 = h5py . File ( "big2.hdf5" )

>>> f1 [ 'big' ] . dtype

dtype('float64')

>>> f2 [ 'big' ] . dtype

dtype('float32')

Automatic Type Conversion and Direct Reads

But exactly how and when does the data get converted between the double-precision

float64 in memory and the single-precision float32 in the file? This question is im‐

portant for performance; after all, if you have a dataset that takes up 90% of the memory

in your computer and you need to make a copy before storing it, there are going to be

problems.

The HDF5 library itself handles type conversion, and does it on the fly when saving to

or reading from a file. Nothing happens at the Python level; your array goes in, and the

appropriate bytes come out on disk. There are built-in routines to convert between many

source and destination formats, including between all flavors of floating-point and in‐

teger numbers available in NumPy.

Search WWH ::

Custom Search

Home