Databases Reference
In-Depth Information
This is the kind of problem that is trivial to parallelize, since each computation doesn't
depend on the others. First, we create our file, containing a single coords dataset:
with h5py . File ( 'coords.hdf5' , 'w' ) as f :
dset = f . create_dataset ( 'coords' , ( 1000 , 2 ), dtype = 'f4' )
dset [ ... ] = np . random . random (( 1000 , 2 ))
Our program will use a simple one-liner for the distance measurement, and a four-
process Pool to carry out the 1,000 conversions required. Note that we don't have any
files open when invoking map :
import numpy as np
from multiprocessing import Pool
import h5py
def distance ( arr ):
""" Compute distance from origin to the point (arr is a shape-(2,) array)
"""
return np . sqrt ( np . sum ( arr ** 2 ))
# Load data and close the input file
with h5py . File ( 'coords.hdf5' , 'r' ) as f :
data = f [ 'coords' ][ ... ]
# Create a 4-process pool
p = Pool ( 4 )
# Carry out parallel computation
result = np . array ( p . map ( distance , data ))
# Write the result into a new dataset in the file
with h5py . File ( 'coords.hdf5' ) as f :
f [ 'distances' ] = result
Doing anything more complex with multiprocessing and HDF5 gets complicated.
Your processes can't all access the same file. Either you do your I/O explicitly in the main
process (as shown), or you have each process generate a bunch of smaller “shard” files
and join them together when you're done:
import os
import numpy as np
from multiprocessing import Pool
import h5py
def distance_block ( idx ):
""" Read a 100-element coordinates block, compute distances, and write
back out again to a process-specific file.
"""
with h5py . File ( 'coords.hdf5' , 'r' ) as f :
data = f [ 'coords' ][ idx : idx + 100 ]
result = np . sqrt ( np . sum ( data ** 2 , axis = 1 ))
Search WWH ::




Custom Search