Databases Reference
In-Depth Information
The Python 2 str type, used earlier, is more properly called a byte string in the Python
world. As the name implies, these are sequences of single-byte elements. They're avail‐
able on both Python 2 and 3 under the name bytes (it's a simple alias for str on Python
2, and a separate type on Python 3). They're intended to hold strictly binary strings,
although in the Python 2 world they play a dual role, generally representing ASCII or
Latin-1 encoded text.
In the HDF5 world, these represent “ASCII” strings. Although no checking is done, they
are expected to contain values in the range 0-127 and represent plain-ASCII text. When
you create a dataset on Python 2 using:
>>> h5py . special_dtype ( vlen = str )
or the equivalent-but-more-readable:
>>> h5py . special_dtype ( vlen = bytes )
the underlying dataset is created with an ASCII character set. Since there are many third-
party applications for HDF5 that understand only ASCII strings, this is by far the most
compatible configuration.
Using Unicode Strings
The Python 2 unicode type properly represents “text” strings, in contrast to the str /
bytes “byte” strings just discussed. On Python 3, “byte” strings are called bytes and the
equivalent “text” strings are called—wait for it— str . Wonderful.
These strings hold sequences of more abstract Unicode characters . You're not supposed
to worry about how they're actually represented. Before you can store them somewhere,
you need to explicitly encode them, which means translating them into byte sequences.
The rules that translate these “text” strings into byte strings are called encodings . HDF5
uses the UTF-8 encoding, which is very space-efficient for strings that contain mainly
Western characters.
You can actually store these “Unicode” or “text” strings directly in HDF5, by using a
similar “special” dtype:
>>> dt = h5py . special_dtype ( vlen = unicode )
>>> dt
dtype(('|O4', [(({'type': <type 'unicode'>}, 'vlen'), '|O4')]))
Like before, you can create datasets and interact with them. But now you can use non-
ASCII characters:
>>> dset = f . create_dataset ( 'vlen_unicode' , ( 100 ,), dtype = dt )
>>> dset [ 0 ] = "Hello"
>>> dset [ 1 ] = u"Accent: \u00E9 "
>>> dset [ 0 ]
u'Hello'
>>> dset [ 1 ]
Search WWH ::




Custom Search