Concurrency: Parallel HDF5, Threading, and Multiprocessing - Python and HDF5

Databases Reference

In-Depth Information

This is the kind of problem that is trivial to parallelize, since each computation doesn't

depend on the others. First, we create our file, containing a single coords dataset:

with h5py . File ( 'coords.hdf5' , 'w' ) as f :

dset = f . create_dataset ( 'coords' , ( 1000 , 2 ), dtype = 'f4' )

dset [ ... ] = np . random . random (( 1000 , 2 ))

Our program will use a simple one-liner for the distance measurement, and a four-

process Pool to carry out the 1,000 conversions required. Note that we don't have any

files open when invoking map :

import numpy as np

from multiprocessing import Pool

import h5py

def distance ( arr ):

""" Compute distance from origin to the point (arr is a shape-(2,) array)

"""

return np . sqrt ( np . sum ( arr ** 2 ))

# Load data and close the input file

with h5py . File ( 'coords.hdf5' , 'r' ) as f :

data = f [ 'coords' ][ ... ]

# Create a 4-process pool

p = Pool ( 4 )

# Carry out parallel computation

result = np . array ( p . map ( distance , data ))

# Write the result into a new dataset in the file

with h5py . File ( 'coords.hdf5' ) as f :

f [ 'distances' ] = result

Doing anything more complex with multiprocessing and HDF5 gets complicated.

Your processes can't all access the same file. Either you do your I/O explicitly in the main

process (as shown), or you have each process generate a bunch of smaller “shard” files

and join them together when you're done:

import os

import numpy as np

from multiprocessing import Pool

import h5py

def distance_block ( idx ):

""" Read a 100-element coordinates block, compute distances, and write

back out again to a process-specific file.

"""

with h5py . File ( 'coords.hdf5' , 'r' ) as f :

data = f [ 'coords' ][ idx : idx + 100 ]

result = np . sqrt ( np . sum ( data ** 2 , axis = 1 ))

Search WWH ::

Custom Search

Home