Databases Reference
In-Depth Information
The timeit function takes a (string or callable) command to execute, and an optional
number of times it should be run. It then prints the total time spent running the com‐
mand. For example, if we execute the “wait” function time.sleep five times:
>>> import time
>>> timeit ( "time.sleep(0.1)" , number = 5 )
0.49967818316434887
If you're using IPython, there's a similar built-in “magic” function called %timeit that
runs the specified statement a few times, and reports the lowest execution time:
>>> % timeit time . sleep ( 0.1 )
10 loops, best of 3: 100 ms per loop
We'll stick with the regular timeit function in this topic, in part because it's provided
by the Python standard library.
Since people using HDF5 generally deal with large datasets, performance is always a
concern. But you'll notice that optimization and benchmarking discussions in this topic
don't go into great detail about things like cache hits, data conversion rates, and so forth.
The design of the h5py package, which this topic uses, leaves nearly all of that to HDF5.
This way, you benefit from the hundreds of man years of work spent on tuning HDF5
to provide the highest performance possible.
As an application builder, the best thing you can do for performance is to use the API
in a sensible way and let HDF5 do its job. Here are some suggestions:
1. Don't optimize anything unless there's a demonstrated performance problem.
Then, carefully isolate the misbehaving parts of the code before changing anything.
2. Start with simple, straightforward code that takes advantage of the API features.
For example, to iterate over all objects in a file, use the Visitor feature of HDF5 (see
“Multilevel Iteration with the Visitor Pattern” on page 68 ) rather than cobbling to‐
gether your own approach.
3. Do “algorithmic” improvements first. For example, when writing to a dataset (see
Chapter 3 ), write data in reasonably sized blocks instead of point by point. This lets
HDF5 use the filesystem in an intelligent way.
4. Make sure you're using the right data types. For example, if you're running a
compute-intensive program that loads floating-point data from a file, make sure
that you're not accidentally using double-precision floats in a calculation where
single precision would do.
5. Finally, don't hesitate to ask for help on the h5py or NumPy/Scipy mailing lists,
Stack Overflow, or other community sites. Lots of people are using NumPy and
HDF5 these days, and many performance problems have known solutions. The
Python community is very welcoming.
Search WWH ::




Custom Search