Databases Reference
In-Depth Information
However, it's common practice to store these data points on disk as single-precision , 4-
byte numbers ( float32 ), saving a factor of 2 in file size.
Let's suppose we have such a NumPy array called bigdata :
>>> bigdata = np . ones (( 100 , 1000 ))
>>> bigdata . dtype
dtype('float64')
>>> bigdata . shape
(100, 1000)
We could store this in a file by simple assignment, resulting in a double-precision dataset:
>>> with h5py . File ( 'big1.hdf5' , 'w' ) as f1 :
... f1 [ 'big' ] = bigdata
$ ls -lh big1.hdf5
-rw-r--r-- 1 computer computer 784K Apr 13 14:40 foo.hdf5
Or we could request that HDF5 store it as single-precision data:
>>> with h5py . File ( 'big2.hdf5' , 'w' ) as f2 :
... f2 . create_dataset ( 'big' , data = bigdata , dtype = np . float32 )
$ ls -lh big2.hdf5
-rw-r--r-- 1 computer computer 393K Apr 13 14:42 foo.hdf5
Keep in mind that whichever one you choose, your data will emerge from the file in that
format:
>>> f1 = h5py . File ( "big1.hdf5" )
>>> f2 = h5py . File ( "big2.hdf5" )
>>> f1 [ 'big' ] . dtype
dtype('float64')
>>> f2 [ 'big' ] . dtype
dtype('float32')
Automatic Type Conversion and Direct Reads
But exactly how and when does the data get converted between the double-precision
float64 in memory and the single-precision float32 in the file? This question is im‐
portant for performance; after all, if you have a dataset that takes up 90% of the memory
in your computer and you need to make a copy before storing it, there are going to be
problems.
The HDF5 library itself handles type conversion, and does it on the fly when saving to
or reading from a file. Nothing happens at the Python level; your array goes in, and the
appropriate bytes come out on disk. There are built-in routines to convert between many
source and destination formats, including between all flavors of floating-point and in‐
teger numbers available in NumPy.
Search WWH ::




Custom Search