Databases Reference
In-Depth Information
This is the kind of problem that is trivial to parallelize, since each computation doesn't
depend on the others. First, we create our file, containing a single
coords
dataset:
with
h5py
.
File
(
'coords.hdf5'
,
'w'
)
as
f
:
dset
=
f
.
create_dataset
(
'coords'
,
(
1000
,
2
),
dtype
=
'f4'
)
dset
[
...
]
=
np
.
random
.
random
((
1000
,
2
))
Our program will use a simple one-liner for the distance measurement, and a four-
process
Pool
to carry out the 1,000 conversions required. Note that we don't have any
files open when invoking
map
:
import
numpy
as
np
from
multiprocessing
import
Pool
import
h5py
def
distance
(
arr
):
""" Compute distance from origin to the point (arr is a shape-(2,) array)
"""
return
np
.
sqrt
(
np
.
sum
(
arr
**
2
))
# Load data and close the input file
with
h5py
.
File
(
'coords.hdf5'
,
'r'
)
as
f
:
data
=
f
[
'coords'
][
...
]
# Create a 4-process pool
p
=
Pool
(
4
)
# Carry out parallel computation
result
=
np
.
array
(
p
.
map
(
distance
,
data
))
# Write the result into a new dataset in the file
with
h5py
.
File
(
'coords.hdf5'
)
as
f
:
f
[
'distances'
]
=
result
Doing anything more complex with
multiprocessing
and HDF5 gets complicated.
Your processes can't all access the same file. Either you do your I/O explicitly in the main
process (as shown), or you have each process generate a bunch of smaller “shard” files
and join them together when you're done:
import
os
import
numpy
as
np
from
multiprocessing
import
Pool
import
h5py
def
distance_block
(
idx
):
""" Read a 100-element coordinates block, compute distances, and write
back out again to a process-specific file.
"""
with
h5py
.
File
(
'coords.hdf5'
,
'r'
)
as
f
:
data
=
f
[
'coords'
][
idx
:
idx
+
100
]
result
=
np
.
sqrt
(
np
.
sum
(
data
**
2
,
axis
=
1
))