Databases Reference
In-Depth Information
Fixed-Length Strings
Strings in HDF5 are a bit of a pain; you got a taste of that in Chapter 6 .
As we'll see in the next section, most real-world strings don't fit neatly into a constant
amount of storage. But fixed-width strings have been around since the FORTRAN days
and fit nicely into the NumPy world.
In NumPy, these are generally created using the “S” dtype. This is a flexible dtype that
lets you set the string length when you create the type. HDF5 supports fixed-length
strings natively:
>>> dt = np . dtype ( "S10" ) # 10-character byte string
>>> dset = f . create_dataset ( 'fixed_string' , ( 100 ,), dtype = dt )
>>> dset [ 0 ] = "Hello"
>>> dset [ 0 ]
'Hello'
Like NumPy fixed-width strings, HDF5 will truncate strings that are too big:
>>> dset [ 0 ] = "thisstringhasmorethan10characters"
>>> dset [ 0 ]
'thisstring'
Technically, these are fixed-length byte strings, which means they use one byte per
character. In HDF5, they are assumed to store ASCII text only. NumPy also supports
fixed-width Unicode strings, which use multiple bytes to store each character and can
represent things outside the ASCII range. The NumPy dtype for this is kind “U,” as in
dtype("U10") .
Unfortunately, HDF5 does not support such “wide-character” Unicode strings, so
there's no way to directly store “U” strings in a file. However, you aren't out of luck on
the Unicode front. First, we'll have to take a detour and discuss one of the best features
in HDF5: variable-length strings.
Variable-Length Strings
If you've used NumPy for a while, you're used to one subtle but important aspect of its
design: all elements in an array have the same size . There are a lot of advantages to this
design; for example, to locate the 115th element of a dataset containing 4-byte floats,
you know to look 460 bytes from the beginning of the array. And most types you use in
everyday computation are of a fixed size—once you've chosen to work with double-
precision floats, for example, they're all 8 bytes wide.
This begins to break down when you come to string types. As we saw earlier, NumPy
natively includes two string types: one 8-bit “ASCII” string type and one 32-bit “Uni‐
code” string type. You have to explicitly set the size of the string when you create the
type. For example, let's create a length-3 ASCII string array and initialize it:
Search WWH ::




Custom Search