Databases Reference
In-Depth Information
However, it's common practice to store these data points on disk as
single-precision
, 4-
byte numbers (
float32
), saving a factor of 2 in file size.
Let's suppose we have such a NumPy array called
bigdata
:
>>>
bigdata
=
np
.
ones
((
100
,
1000
))
>>>
bigdata
.
dtype
dtype('float64')
>>>
bigdata
.
shape
(100, 1000)
We could store this in a file by simple assignment, resulting in a double-precision dataset:
>>>
with
h5py
.
File
(
'big1.hdf5'
,
'w'
)
as
f1
:
...
f1
[
'big'
]
=
bigdata
$ ls -lh big1.hdf5
-rw-r--r-- 1 computer computer 784K Apr 13 14:40 foo.hdf5
Or we could request that HDF5 store it as single-precision data:
>>>
with
h5py
.
File
(
'big2.hdf5'
,
'w'
)
as
f2
:
...
f2
.
create_dataset
(
'big'
,
data
=
bigdata
,
dtype
=
np
.
float32
)
$ ls -lh big2.hdf5
-rw-r--r-- 1 computer computer 393K Apr 13 14:42 foo.hdf5
Keep in mind that whichever one you choose, your data will emerge from the file in that
format:
>>>
f1
=
h5py
.
File
(
"big1.hdf5"
)
>>>
f2
=
h5py
.
File
(
"big2.hdf5"
)
>>>
f1
[
'big'
]
.
dtype
dtype('float64')
>>>
f2
[
'big'
]
.
dtype
dtype('float32')
Automatic Type Conversion and Direct Reads
But exactly how and when does the data get converted between the double-precision
float64
in memory and the single-precision
float32
in the file? This question is im‐
portant for performance; after all, if you have a dataset that takes up 90% of the memory
in your computer and you need to make a copy before storing it, there are going to be
problems.
The HDF5 library
itself
handles type conversion, and does it on the fly when saving to
or reading from a file. Nothing happens at the Python level; your array goes in, and the
appropriate bytes come out on disk. There are built-in routines to convert between many
source and destination formats, including between all flavors of floating-point and in‐
teger numbers available in NumPy.