Databases Reference
In-Depth Information
u'Accent: \xe9'
>>>
print
dset
[
1
]
Accent: é
When you create this kind of a dataset, the underlying HDF5 character set is set to
“UTF-8.” The only disadvantage is that some older third-party applications, like IDL,
may not be able to read your strings. If compatibility with legacy code like this is essential
for your application, make sure you test!
Remember the default string on Python 3,
str
, is actually a Unicode
string. So on Python 3,
h5py.special_dtype(vlen=str)
will give you
a UTF-8 dataset, not the compatible-with-everything ASCII dataset.
Use
vlen=bytes
instead to get an ASCII dataset.
Don't Store Binary Data in Strings!
Finally, note that HDF5 will allow you to store raw binary data using the “ASCII” dataset
dtype created with
special_dtype(vlen=bytes)
. This may work, but is generally con‐
sidered evil. And because of how the strings are handled internally, if your binary string
has NULLs in it (
"\x00"
), it will be silently truncated!
The best way to store raw binary data is with the “opaque” type (see
“Opaque Types” on
page 98
).
Future-Proofing Your Python 2 Application
Finally, here are some simple rules you can follow to keep the bytes/Unicode mess from
driving you mad. They will also help you when porting to Python 3, using the context-
free translation tool
2to3
that ships with Python.
1. Keep the text-versus-bytes distinction clear in your mind, and cleanly separate the
two in code.
2. Always use the alias
bytes
instead of
str
when you're sure you want a byte string.
For literals, you can even use the “b” prefix, for example,
b"Hello"
. In particular,
when calling
special_dtype
to create a byte string,
always
use
bytes
.
3. For text strings use
str
, or better yet,
unicode
. Unicode literals are entered with a
leading “u”:
u"Hello"
.
Compound Types
For some kinds of data, it makes sense to bundle closely related values together into a
single element. The classic example is a C
struct
: multiple pieces of data that are handled