Concurrency: Parallel HDF5, Threading, and Multiprocessing - Python and HDF5

Databases Reference

In-Depth Information

your parallel processes can't share a single HDF5 file, even if the file is opened read-

only. This is a limitation of the HDF5 library. multiprocessing -based code is great for

long-running, CPU-bound problems in which data I/O is a relatively small component

handled by the master process. It's the simplest way in Python to write parallel code that

uses more than one core, and strongly recommended for general-purpose computation.

For anything else, MPI-based Parallel HDF5 is by far the best way to go. MPI is the

official “flavor” of parallelism supported by the HDF5 library. You can have an unlimited

number of processes, all of which share the same open HDF5 file. All of them can read

and write data, and modify the file's structure. Programs written this way require a little

more care, but it's the most elegant and highest-performance way to use HDF5 in a

parallel context.

By the way, there's even a fourth mechanism for parallel computing, which is becoming

increasingly popular. IPython, which you're probably already using as a more conve‐

nient interpreter interface, has its own clustering capabilities designed around the

ZeroMQ networking system. It can even use MPI on the back end to improve the per‐

formance of parallel code. At the moment, not much real-world use of h5py in the

IPython parallel model is known to the author. But you should definitely keep an eye

on it, if for no other reason than the very convenient interface that IPython provides

for clustering.

Threading

HDF5 has no native support for thread-level parallelism. While you can safely use HDF5

(and h5py) from threaded programs, there will be no performance advantage. But for

a lot of applications, that's not a problem. GUI apps, for example, spend most of their

time waiting for user input.

Before we start, it's important to review a few basics about threaded programming in

the Python world. Python itself includes a single master lock that governs access to the

interpreter's functions, called the Global Interpreter Lock or GIL. This lock serializes

access from multiple threads to basic resources like object reference counting. You can

have as many threads as you like in a Python program, but only one at a time can use

the interpreter.

This isn't such a big deal, particularly when writing programs like GUIs or web-based

applications that spend most of their time waiting for events. In these situations, the

GIL is “released” and other threads can use the interpreter while an I/O-bound thread

is waiting.

h5py uses a similar concept. Access to HDF5 is serialized using locking, so only one

thread at a time can work with the library. Unlike other I/O mechanisms in Python, all

use of the HDF5 library is “blocking”; once a call is made to HDF5, the GIL is not released

until it completes.

Search WWH ::

Custom Search

Home