Databases Reference
In-Depth Information
>>> dt = np . dtype ( 'S3' )
>>> a = np . array ( [ "a" , "ab" , "abc" , "abcd" ], dtype = dt )
>>> a
array(['a', 'ab', 'abc', 'abc'],
dtype='|S3')
The limitation is obvious: elements with more than three characters are simply trun‐
cated, and the information is lost. It's tempting to simply increase the length of the string
type, say to 100, or 256. But we end up wasting a lot of memory, and there's still no
guarantee our guess will be large enough:
# Read first 5 lines from file
# Ed M. 4/3/12: Increased max line size from 100 to 256 per issue #344
# Ed M. 5/1/12: Increased to 1000 per issue #345
# Ed M. 6/2/12: Fixed.
# TODO: mysterious crashes with MemoryError when many threads running (#346)
a = np . empty (( 5 ,), dtype = 'S100000' )
for idx in xrange ( 5 ):
a [ idx ] = textfile . readline ()
This isn't a problem in every application, of course. But there's no getting around the
fact that strings in real-world data can have virtually any length.
Fortunately, HDF5 has a mechanism to handle this: variable-length strings . Like native
Python strings (and strings in C), these can be any width that fits in memory. Here's
how to take advantage of them.
The vlen String Data Type
First, since NumPy doesn't support variable-length strings at all, we need to use a special
dtype provided by h5py:
>>> dt = h5py . special_dtype ( vlen = str )
>>> dt
dtype(('|O4', [(({'type': <type 'str'>}, 'vlen'), '|O4')]))
That looks like a mess. But it's actually a standard NumPy dtype with some metadata
attached. In this case, the underlying type is the NumPy object dtype:
>>> dt . kind
'O'
NumPy arrays of kind " O " hold ordinary Python objects. So the dtype effectively says,
“This is an object array, which is intended to hold Python strings.”
Search WWH ::




Custom Search