Getting Started - Python and HDF5

Databases Reference

In-Depth Information

The timeit function takes a (string or callable) command to execute, and an optional

number of times it should be run. It then prints the total time spent running the com‐

mand. For example, if we execute the “wait” function time.sleep five times:

>>> import time

>>> timeit ( "time.sleep(0.1)" , number = 5 )

0.49967818316434887

If you're using IPython, there's a similar built-in “magic” function called %timeit that

runs the specified statement a few times, and reports the lowest execution time:

>>> % timeit time . sleep ( 0.1 )

10 loops, best of 3: 100 ms per loop

We'll stick with the regular timeit function in this topic, in part because it's provided

by the Python standard library.

Since people using HDF5 generally deal with large datasets, performance is always a

concern. But you'll notice that optimization and benchmarking discussions in this topic

don't go into great detail about things like cache hits, data conversion rates, and so forth.

The design of the h5py package, which this topic uses, leaves nearly all of that to HDF5.

This way, you benefit from the hundreds of man years of work spent on tuning HDF5

to provide the highest performance possible.

As an application builder, the best thing you can do for performance is to use the API

in a sensible way and let HDF5 do its job. Here are some suggestions:

1. Don't optimize anything unless there's a demonstrated performance problem.

Then, carefully isolate the misbehaving parts of the code before changing anything.

2. Start with simple, straightforward code that takes advantage of the API features.

For example, to iterate over all objects in a file, use the Visitor feature of HDF5 (see

“Multilevel Iteration with the Visitor Pattern” on page 68 ) rather than cobbling to‐

gether your own approach.

3. Do “algorithmic” improvements first. For example, when writing to a dataset (see

Chapter 3 ), write data in reasonably sized blocks instead of point by point. This lets

HDF5 use the filesystem in an intelligent way.

4. Make sure you're using the right data types. For example, if you're running a

compute-intensive program that loads floating-point data from a file, make sure

that you're not accidentally using double-precision floats in a calculation where

single precision would do.

5. Finally, don't hesitate to ask for help on the h5py or NumPy/Scipy mailing lists,

Stack Overflow, or other community sites. Lots of people are using NumPy and

HDF5 these days, and many performance problems have known solutions. The

Python community is very welcoming.

Search WWH ::

Custom Search

Home