More About Types - Python and HDF5

Databases Reference

In-Depth Information

The Python 2 str type, used earlier, is more properly called a byte string in the Python

world. As the name implies, these are sequences of single-byte elements. They're avail‐

able on both Python 2 and 3 under the name bytes (it's a simple alias for str on Python

2, and a separate type on Python 3). They're intended to hold strictly binary strings,

although in the Python 2 world they play a dual role, generally representing ASCII or

Latin-1 encoded text.

In the HDF5 world, these represent “ASCII” strings. Although no checking is done, they

are expected to contain values in the range 0-127 and represent plain-ASCII text. When

you create a dataset on Python 2 using:

>>> h5py . special_dtype ( vlen = str )

or the equivalent-but-more-readable:

>>> h5py . special_dtype ( vlen = bytes )

the underlying dataset is created with an ASCII character set. Since there are many third-

party applications for HDF5 that understand only ASCII strings, this is by far the most

compatible configuration.

Using Unicode Strings

The Python 2 unicode type properly represents “text” strings, in contrast to the str /

bytes “byte” strings just discussed. On Python 3, “byte” strings are called bytes and the

equivalent “text” strings are called—wait for it— str . Wonderful.

These strings hold sequences of more abstract Unicode characters . You're not supposed

to worry about how they're actually represented. Before you can store them somewhere,

you need to explicitly encode them, which means translating them into byte sequences.

The rules that translate these “text” strings into byte strings are called encodings . HDF5

uses the UTF-8 encoding, which is very space-efficient for strings that contain mainly

Western characters.

You can actually store these “Unicode” or “text” strings directly in HDF5, by using a

similar “special” dtype:

>>> dt = h5py . special_dtype ( vlen = unicode )

>>> dt

dtype(('|O4', [(({'type': <type 'unicode'>}, 'vlen'), '|O4')]))

Like before, you can create datasets and interact with them. But now you can use non-

ASCII characters:

>>> dset = f . create_dataset ( 'vlen_unicode' , ( 100 ,), dtype = dt )

>>> dset [ 0 ] = "Hello"

>>> dset [ 1 ] = u"Accent: \u00E9 "

>>> dset [ 0 ]

u'Hello'

>>> dset [ 1 ]

Search WWH ::

Custom Search

Home