Biomedical Engineering Reference
In-Depth Information
4.2. Parallelization of the Co-Volume Algorithm Using MPI
There are two main goals for parallelization of a program: to handle huge
amounts of data that cannot be placed into the memory of one single serial com-
puter, and to run the program faster. Let us suppose that in terms of running time,
a fraction p of a program can be parallelized. In an ideal situation, executing the
parallelized program on n processors, the running time will be 1
p
n .We
can see that, if, e.g., only 80% of the program can be parallelized, i.e., p =0 . 8,
the maximal speed-up (estimated from above by 1
p +
1 −p ) cannot exceed 5, although
with infinitely many processors. This illustrative example shows that it is very
important to identify the fraction of the program that can be parallelized and maxi-
mize it. Fortunately, every time-consuming part of our algorithm can be efficiently
parallelized either directly (reading and writing data, computing coefficients of the
linear system), or it is possible to change the serial linear solver (SOR method) to
the parallel solver (e.g., RED-BLACK SOR method), which can be parallelized
efficiently. The next important issue in parallelization is to balance the workload
of the parallel processes. This problem can be addressed by as uniform as possible
splitting of the data so that every process provides approximately the same number
of operations. The final and very important step is to minimize the time spent for
communication. This leads, e.g., to a requirement that the data transmitted (e.g.,
multidimensional arrays) be contiguous in memory, so that one can exchange it
among processors directly in one message using only one call of MPI send and re-
ceive subroutines. For parallel algorithms implemented in C language this means
that multidimensional arrays should be distributed in row-wise blocks, or, better,
say we have to split the multidimensional array like u [ i ][ j ][ k ] in the first index i
(cf. Figure 7).
In Figure 8 we can see the main structure of our parallel program. First, the
MPI environment is initiated, and every process involved in parallel execution gets
its rank stored in variable myid=0 ,...,n procs 1, where 0 represents the root
process and n procs is the number of parallel processes. Then by the root process
we read the time step τ and the upper estimate of the number of time steps nts .
These parameters are sent to all processes by the MPI Bcast subroutine. In the
beginning of the algorithm we also read the image, compute the discrete values of
g T in function Coefficients0, and construct the initial segmentation function. All
these procedures work independently on their own (overlapping) subsets of data,
and no exchange of information between processes is necessary in this part of the
program. Then in the cycle we call procedure EllipticStep, which in every time
step contains computing of coefficients (25) and solving the linear system (26). In
the iterative solution of the linear system we will need to exchange overlapping
data between neighbouring processes. The cycle is finished when condition ((23))
is fulfilled, and finally the MPI environment is finalized.
Figure 9 shows our distribution of data among the processes. Both the 3D
image and the discrete values of the segmentation function are represented by a
 
Search WWH ::




Custom Search