Biomedical Engineering Reference
In-Depth Information
Figure 8.16 (a) Scalability and (b) speedup for increasing numbers of nodes running the most
work-intensive image from the mammary set.
than GPU-assisted executions they are also more scalable on a larger number of
nodes due to the communication bindings of paired CPU-GPU executions. This is
confirmed in Figure 8.14, where the most work-intensive mammary image is tested
for an assorted combination of CPUs and GPUs.
For increasing numbers of nodes Figures 8.15 and 8.16 show a progressive
reduction in execution times. For the most work-intensive mammary image, the
speedup on 16 versus 2 nodes with a 1 CPU configuration is 7.4x, where for a 2
CPU/GPU per node configuration the speedup is slightly over 4x. The less effective
internode parallelism of the more aggressive configurations is due in large part to
their more demanding intranode communications.
8.7 Summary
The next generation of automated microscope imaging applications, such as quan-
titative phenotyping, require the analysis of extremely large datasets, making scal-
ability and parallelization of algorithms essential.
This chapter presents a fast, scalable, and simply parallelizable algorithm for
image registration that is capable of correcting the nonrigid distortions of sec-
tioned microscope images. Rigid initialization follows a simply reasoned process
of matching high level features that are quickly and easily extracted through stan-
dard image processing techniques. Nonrigid registration refines the result of rigid
initialization, using the estimates of rigid initialization to match intensity features
using an FFT-implementation of normalized cross-correlation.
A computational framework for the two-stage algorithm is also provided along
with results from sample high-performance implementations. Two hardware-based
solutions are presented for nonrigid feature matching: parallel systems and graphics
processor acceleration. Scalability is demonstrated on both single node systems
where GPUs and CPUs cooperate, and also on multiple node systems where any
variety of the single node configurations can divide the work. From a departure
point of 181 hours to run 500 mammary images on a single Opteron CPU, the
GPU accelerated parallel implementation is able to reduce this time to 3.7 hours
 
Search WWH ::




Custom Search