More About Types - Python and HDF5

Databases Reference

In-Depth Information

>>> dt = np . dtype ( 'S3' )

>>> a = np . array ( [ "a" , "ab" , "abc" , "abcd" ], dtype = dt )

>>> a

array(['a', 'ab', 'abc', 'abc'],

dtype='|S3')

The limitation is obvious: elements with more than three characters are simply trun‐

cated, and the information is lost. It's tempting to simply increase the length of the string

type, say to 100, or 256. But we end up wasting a lot of memory, and there's still no

guarantee our guess will be large enough:

# Read first 5 lines from file

# Ed M. 4/3/12: Increased max line size from 100 to 256 per issue #344

# Ed M. 5/1/12: Increased to 1000 per issue #345

# Ed M. 6/2/12: Fixed.

# TODO: mysterious crashes with MemoryError when many threads running (#346)

a = np . empty (( 5 ,), dtype = 'S100000' )

for idx in xrange ( 5 ):

a [ idx ] = textfile . readline ()

This isn't a problem in every application, of course. But there's no getting around the

fact that strings in real-world data can have virtually any length.

Fortunately, HDF5 has a mechanism to handle this: variable-length strings . Like native

Python strings (and strings in C), these can be any width that fits in memory. Here's

how to take advantage of them.

The vlen String Data Type

First, since NumPy doesn't support variable-length strings at all, we need to use a special

dtype provided by h5py:

>>> dt = h5py . special_dtype ( vlen = str )

>>> dt

dtype(('|O4', [(({'type': <type 'str'>}, 'vlen'), '|O4')]))

That looks like a mess. But it's actually a standard NumPy dtype with some metadata

attached. In this case, the underlying type is the NumPy object dtype:

>>> dt . kind

'O'

NumPy arrays of kind " O " hold ordinary Python objects. So the dtype effectively says,

“This is an object array, which is intended to hold Python strings.”

Search WWH ::

Custom Search

Home