Biomedical Engineering Reference
In-Depth Information
Because our implementation uses two arrays for each tissue equation, one for
the current and one for the new time steps, multithreading the solution of the tissue
equations is a straightforward use of an omp parallel for directive. The migration
and proliferation steps, however, require more than the simple addition of an omp
parallel directive. Both of these steps involve one thread examining and placing a
tumor cell in tissue volumes that another thread is responsible for, and proliferation
involves memory allocation when tumor cells divide. As a result, performing these
steps in multiple threads requires the use of omp critical directives to coordinate
the moves and allocation. Finally, we note that since both of these steps involve
counting the number of neighboring tissue volumes that contain cells, an additional
critical section is required to ensure proper counting; since we only need to test for
null pointers when counting, however, we forego using a critical section and accept
the additional stochasticity improper counting introduces, rather than incure the
additional overhead.
4.4 Performance
To demonstrate the utility of our approach, we have investigated the performance
and scaling of our implementation. Because memory on each processor of Blue
Gene is limited, it is not possible to demonstrate strong scaling except on the
smallest of problems. Instead, we examine the weak scaling characteristics of our
implementation, using a test problem comprised of 64
64 volumes on each
processor, so the total problem size scales with the number of processors. Note
that 64 3
×
64
×
2 18 tissue volumes per processor, and 4,096
2 12 processors, yielding
=
=
a total problem size of 2 30
10 9
, which corresponds to the lower end of
clinicially relevant maximum tumor sizes. For simplicity and ease of comparison we
restrict our calculations to square tissue on cube shaped partitions ( P x =
=
O
(
)
P z ).
The total problem size also scales with the number of generations that are
computed. Since different sized tumors require different numbers of generations
be computed to generate the same number of tumor cells per processor, we focus
instead on the per generation time. This requires that we assign each processor the
same amount of work, independent of the number of processors. We have therefore
chosen to measure the timing of the first generation ( T
P y =
=
1withD t
=
0.001, so
that N
1,000 time steps), using a random initial condition where each tissue
volume is assigned a tumor cell with probability one-half. As a result, our timings
approximate a worst-case scenario in terms of per processor work, but a best-case
scenario in terms of load balance.
The per generation timings on Blue Gene/L are shown for varying numbers of
processors (32, 64, 128, 512, 1,024, 2,048, 4,096 and 8,192) in Figure 4.2. The
parallel tumor invasion implementation demonstrates almost perfect weak scaling
to one rack (1,024 processors) and reasonable weak scaling to eight racks (8,192
processors). The slightly reduced scaling on multiple racks is to be expected, due to
the reduced communication bandwidth between racks, but the reduction in overall
per processor, per generation performance is still quite good, yielding only a 24%
reduction in scaling from 32 or 64 processors to 8,192 processors.
As described above, to keep the total problem size identical in our weak scaling
runs, we used a random initial condition that yielded a worst-case scenario in
=
 
Search WWH ::




Custom Search