Utilizing Parallel Processing in Computational Biology Applications - High-Throughput Image Reconstruction and Analysis

Biomedical Engineering Reference

In-Depth Information

Because our implementation uses two arrays for each tissue equation, one for

the current and one for the new time steps, multithreading the solution of the tissue

equations is a straightforward use of an omp parallel for directive. The migration

and proliferation steps, however, require more than the simple addition of an omp

parallel directive. Both of these steps involve one thread examining and placing a

tumor cell in tissue volumes that another thread is responsible for, and proliferation

involves memory allocation when tumor cells divide. As a result, performing these

steps in multiple threads requires the use of omp critical directives to coordinate

the moves and allocation. Finally, we note that since both of these steps involve

counting the number of neighboring tissue volumes that contain cells, an additional

critical section is required to ensure proper counting; since we only need to test for

null pointers when counting, however, we forego using a critical section and accept

the additional stochasticity improper counting introduces, rather than incure the

additional overhead.

4.4 Performance

To demonstrate the utility of our approach, we have investigated the performance

and scaling of our implementation. Because memory on each processor of Blue

Gene is limited, it is not possible to demonstrate strong scaling except on the

smallest of problems. Instead, we examine the weak scaling characteristics of our

implementation, using a test problem comprised of 64

64 volumes on each

processor, so the total problem size scales with the number of processors. Note

that 64 3

×

64

×

2 18 tissue volumes per processor, and 4,096

2 12 processors, yielding

=

a total problem size of 2 30

10 9

, which corresponds to the lower end of

clinicially relevant maximum tumor sizes. For simplicity and ease of comparison we

restrict our calculations to square tissue on cube shaped partitions ( P x =

=

O

(

)

P z ).

The total problem size also scales with the number of generations that are

computed. Since different sized tumors require different numbers of generations

be computed to generate the same number of tumor cells per processor, we focus

instead on the per generation time. This requires that we assign each processor the

same amount of work, independent of the number of processors. We have therefore

chosen to measure the timing of the first generation ( T

P y =

=

1withD t

=

0.001, so

that N

1,000 time steps), using a random initial condition where each tissue

volume is assigned a tumor cell with probability one-half. As a result, our timings

approximate a worst-case scenario in terms of per processor work, but a best-case

scenario in terms of load balance.

The per generation timings on Blue Gene/L are shown for varying numbers of

processors (32, 64, 128, 512, 1,024, 2,048, 4,096 and 8,192) in Figure 4.2. The

parallel tumor invasion implementation demonstrates almost perfect weak scaling

to one rack (1,024 processors) and reasonable weak scaling to eight racks (8,192

processors). The slightly reduced scaling on multiple racks is to be expected, due to

the reduced communication bandwidth between racks, but the reduction in overall

per processor, per generation performance is still quite good, yielding only a 24%

reduction in scaling from 32 or 64 processors to 8,192 processors.

As described above, to keep the total problem size identical in our weak scaling

runs, we used a random initial condition that yielded a worst-case scenario in

=

High-Throughput Image Reconstruction and Analysis

Search WWH ::

Custom Search

Home