Databases Reference
In-Depth Information
your parallel processes can't share a single HDF5 file, even if the file is opened read-
only. This is a limitation of the HDF5 library. multiprocessing -based code is great for
long-running, CPU-bound problems in which data I/O is a relatively small component
handled by the master process. It's the simplest way in Python to write parallel code that
uses more than one core, and strongly recommended for general-purpose computation.
For anything else, MPI-based Parallel HDF5 is by far the best way to go. MPI is the
official “flavor” of parallelism supported by the HDF5 library. You can have an unlimited
number of processes, all of which share the same open HDF5 file. All of them can read
and write data, and modify the file's structure. Programs written this way require a little
more care, but it's the most elegant and highest-performance way to use HDF5 in a
parallel context.
By the way, there's even a fourth mechanism for parallel computing, which is becoming
increasingly popular. IPython, which you're probably already using as a more conve‐
nient interpreter interface, has its own clustering capabilities designed around the
ZeroMQ networking system. It can even use MPI on the back end to improve the per‐
formance of parallel code. At the moment, not much real-world use of h5py in the
IPython parallel model is known to the author. But you should definitely keep an eye
on it, if for no other reason than the very convenient interface that IPython provides
for clustering.
Threading
HDF5 has no native support for thread-level parallelism. While you can safely use HDF5
(and h5py) from threaded programs, there will be no performance advantage. But for
a lot of applications, that's not a problem. GUI apps, for example, spend most of their
time waiting for user input.
Before we start, it's important to review a few basics about threaded programming in
the Python world. Python itself includes a single master lock that governs access to the
interpreter's functions, called the Global Interpreter Lock or GIL. This lock serializes
access from multiple threads to basic resources like object reference counting. You can
have as many threads as you like in a Python program, but only one at a time can use
the interpreter.
This isn't such a big deal, particularly when writing programs like GUIs or web-based
applications that spend most of their time waiting for events. In these situations, the
GIL is “released” and other threads can use the interpreter while an I/O-bound thread
is waiting.
h5py uses a similar concept. Access to HDF5 is serialized using locking, so only one
thread at a time can work with the library. Unlike other I/O mechanisms in Python, all
use of the HDF5 library is “blocking”; once a call is made to HDF5, the GIL is not released
until it completes.
Search WWH ::




Custom Search