More About Types - Python and HDF5

Databases Reference

In-Depth Information

u'Accent: \xe9'

>>> print dset [ 1 ]

Accent: é

When you create this kind of a dataset, the underlying HDF5 character set is set to

“UTF-8.” The only disadvantage is that some older third-party applications, like IDL,

may not be able to read your strings. If compatibility with legacy code like this is essential

for your application, make sure you test!

Remember the default string on Python 3, str , is actually a Unicode

string. So on Python 3, h5py.special_dtype(vlen=str) will give you

a UTF-8 dataset, not the compatible-with-everything ASCII dataset.

Use vlen=bytes instead to get an ASCII dataset.

Don't Store Binary Data in Strings!

Finally, note that HDF5 will allow you to store raw binary data using the “ASCII” dataset

dtype created with special_dtype(vlen=bytes) . This may work, but is generally con‐

sidered evil. And because of how the strings are handled internally, if your binary string

has NULLs in it ( "\x00" ), it will be silently truncated!

The best way to store raw binary data is with the “opaque” type (see “Opaque Types” on

page 98 ).

Future-Proofing Your Python 2 Application

Finally, here are some simple rules you can follow to keep the bytes/Unicode mess from

driving you mad. They will also help you when porting to Python 3, using the context-

free translation tool 2to3 that ships with Python.

1. Keep the text-versus-bytes distinction clear in your mind, and cleanly separate the

two in code.

2. Always use the alias bytes instead of str when you're sure you want a byte string.

For literals, you can even use the “b” prefix, for example, b"Hello" . In particular,

when calling special_dtype to create a byte string, always use bytes .

3. For text strings use str , or better yet, unicode . Unicode literals are entered with a

leading “u”: u"Hello" .

Compound Types

For some kinds of data, it makes sense to bundle closely related values together into a

single element. The classic example is a C struct : multiple pieces of data that are handled

Search WWH ::

Custom Search

Home