Databases Reference
In-Depth Information
Depending on your version of h5py, you may see a different result
when you print the dtype; the details of how the “special” data is at‐
tached vary. Don't depend on any specific implementation. Always use
the
special_dtype
function and don't try to piece one together your‐
self.
Working with vlen String Datasets
You can use a “special” dtype to create an array in the normal fashion. Here we create a
100-element variable-length string dataset:
>>>
dset
=
f
.
create_dataset
(
'vlen_dataset'
,
(
100
,),
dtype
=
dt
)
You can write strings into it from anything that looks “string-shaped,” including ordi‐
nary Python strings and fixed-length NumPy strings:
>>>
dset
[
0
]
=
"Hello"
>>>
dset
[
1
]
=
np
.
string_
(
"Hello2"
)
>>>
dset
[
3
]
=
"X"
*
10000
Retrieving a single element, you get a Python string:
>>>
out
=
dset
[
0
]
>>>
type
(
out
)
str
Retrieving more than one, you get an object array full of Python strings:
>>>
dset
[
0
:
2
]
array([Hello, Hello2], dtype=object)
There's one caveat here: for technical reasons, the array returned has a plain-vanilla
“object” dtype, not the fancy dtype we created from
h5py.special_dtype
:
>>>
out
=
dset
[
0
:
1
]
>>>
out
.
dtype
dtype('object')
This is one of very few cases where
dset[...].dtype != dset.dtype
.
Byte Versus Unicode Strings
The preceding examples, like the rest of this topic, are written assuming you are using
Python 2. However, in both Python 2 and 3 there exist two “flavors” of string you should
be aware of. They are stored in the file slightly differently, and this has implications for
both internationalized applications and data portability.
A complete discussion of the bytes/Unicode mess in Python is beyond the scope of this
topic. However, it's important to discuss how the two types interact with HDF5.