Databases Reference
In-Depth Information
There's no real performance difference when using (100, 50)-shape arrays, but what
about (10000, 10000)-shape arrays?
Let's check the real-world performance of this. We'll create a test dataset and two func‐
tions. To keep things simple and isolate just the performance difference related to the
status of out , we'll always read the same selection of the dataset:
dset = f . create_dataset ( 'perftest' , ( 10000 , 10000 ), dtype = np . float32 )
dset [:] = np . random . random ( 10000 ) # note the use of broadcasting!
def time_simple ():
dset [:, 0 : 500 ] . mean ( axis = 1 )
out = np . empty (( 10000 , 500 ), dtype = np . float32 )
def time_direct ():
dset . read_direct ( out , np . s_ [:, 0 : 500 ])
out . mean ( axis = 1 )
Now we'll see what effect preserving the out array has, if we were to put the read in a
for loop with 100 iterations:
>>> timeit ( time_simple , number = 100 )
14.04414987564087
>>> timeit ( time_direct , number = 100 )
12.045655965805054
Not too bad. The difference is 2 seconds, or about a 14% improvement. Of course, as
with all optimizations, it's up to you how far you want to go. This “simple” approach is
certainly more legible. But when performing multiple reads of data with the same shape,
particularly with larger arrays, it's hard to beat read_direct .
For historical reasons, there also exists a write_direct method. It does
the same in reverse; however, in modern versions of h5py it's no more
efficient than regular slicing assignment. You're welcome to use it if
you want, but there's no performance advantage.
A Note on Data Types
HDF5 is designed to preserve data in any format you want. Occasionally, this means
you may get a file whose contents differ from the most convenient format for processing
on your system. One example we discussed before is endianness , which relates to how
multibyte numbers are represented. You can store a 4-byte floating-point number, for
example, in memory with the least significant byte first ( little-endian ), or with the most
significant byte first ( big-endian ). Modern Intel-style x86 chips use the little-endian
format, but data can be stored in HDF5 in either fashion.
Search WWH ::




Custom Search