Concurrency: Parallel HDF5, Threading, and Multiprocessing - Python and HDF5

Databases Reference

In-Depth Information

with h5py . File ( 'result_index_ %d .hdf5' % idx , 'w' ) as f :

f [ 'result' ] = result

# Create out pool and carry out the computation

p = Pool ( 4 )

p . map ( distance_block , xrange ( 0 , 1000 , 100 ))

with h5py . File ( 'coords.hdf5' ) as f :

dset = f . create_dataset ( 'distances' , ( 1000 ,), dtype = 'f4' )

# Loop over our 100-element "chunks" and merge the data into coords.hdf5

for idx in xrange ( 0 , 1000 , 100 ):

filename = 'result_index_ %d .hdf5' % idx

with h5py . File ( filename , 'r' ) as f2 :

data = f2 [ 'result' ][ ... ]

dset [ idx : idx + 100 ] = data

os . unlink ( filename ) # no longer needed

That looks positively exhausting, mainly because of the limitations on passing open files

to child processes. What if there were a way to share a single file between processes,

automatically synchronizing reads and writes? It turns out there is: Parallel HDF5.

MPI and Parallel HDF5

Figure 9-3 shows how an application works using Parallel HDF5, in contrast to the

threading and multiprocessing approaches earlier. MPI-based applications work by

launching multiple parallel instances of the Python interpreter. Those instances com‐

municate with each other via the MPI library. The key difference compared to multi

processing is that the processes are peers , unlike the child processes used for the Pool

objects we saw earlier.

Search WWH ::

Custom Search

Home