Databases Reference
In-Depth Information
Fixed-Length Strings
Strings in HDF5 are a bit of a pain; you got a taste of that in
Chapter 6
.
As we'll see in the next section, most real-world strings don't fit neatly into a constant
amount of storage. But fixed-width strings have been around since the FORTRAN days
and fit nicely into the NumPy world.
In NumPy, these are generally created using the “S” dtype. This is a
flexible
dtype that
lets you set the string length when you create the type. HDF5 supports fixed-length
strings natively:
>>>
dt
=
np
.
dtype
(
"S10"
)
# 10-character byte string
>>>
dset
=
f
.
create_dataset
(
'fixed_string'
,
(
100
,),
dtype
=
dt
)
>>>
dset
[
0
]
=
"Hello"
>>>
dset
[
0
]
'Hello'
Like NumPy fixed-width strings, HDF5 will truncate strings that are too big:
>>>
dset
[
0
]
=
"thisstringhasmorethan10characters"
>>>
dset
[
0
]
'thisstring'
Technically, these are fixed-length
byte
strings, which means they use one byte per
character. In HDF5, they are assumed to store ASCII text only. NumPy also supports
fixed-width Unicode strings, which use multiple bytes to store each character and can
represent things outside the ASCII range. The NumPy dtype for this is kind “U,” as in
dtype("U10")
.
Unfortunately, HDF5 does not support such “wide-character” Unicode strings, so
there's no way to directly store “U” strings in a file. However, you aren't out of luck on
the Unicode front. First, we'll have to take a detour and discuss one of the best features
in HDF5: variable-length strings.
Variable-Length Strings
If you've used NumPy for a while, you're used to one subtle but important aspect of its
design: all elements in an array
have the same size
. There are a lot of advantages to this
design; for example, to locate the 115th element of a dataset containing 4-byte floats,
you know to look 460 bytes from the beginning of the array. And most types you use in
everyday computation are of a fixed size—once you've chosen to work with double-
precision floats, for example, they're all 8 bytes wide.
This begins to break down when you come to string types. As we saw earlier, NumPy
natively includes two string types: one 8-bit “ASCII” string type and one 32-bit “Uni‐
code” string type. You have to explicitly set the size of the string when you create the
type. For example, let's create a length-3 ASCII string array and initialize it: