Scalable Image Registration and 3D Reconstruction at Microscopic Resolution - High-Throughput Image Reconstruction and Analysis - page 196

Biomedical Engineering Reference

In-Depth Information

(a cost which can later be amortized by loading the plan from disk at runtime on

subsequent transforms of the same size).

8.4.3 GPU Acceleration

To further speed up the demanding process of intensity feature matching, GPU can

be employed to perform the FFT operations used in normalized cross-correlation

calculation. The FFT operation is a good candidate for GPU acceleration since it

is a highly parallel ''divide and conquer'' algorithm. In general, the performance

of algorithms on GPUs depends on how well parallelism, closer memory, bus

bandwidth, and GFLOPs are exploited.

General Purpose GPU: CUDA

GPU resources are a specialized set of registers and instructions intended specifi-

cally for common graphics operations. Programming models exist to access these

resources for more general purpose tasks. One such interface, the Compute Uni-

fied Device Architecture (CUDA) [47], consists of a set of library functions which

can be coded as an extension of the C language. The CUDA has an associated

compiler that generates code acceptable for the GPU, which is seen by the CPU as

a multicore processor. The CUDA hardware interface, shown in Figure 8.7, hides

the graphics pipeline notions (see Figure 8.6), instead presenting the GPU resources

as a collection of threads running in parallel.

FFT on GPU

To accelerate intensity feature matching, the forward FFTs, pointwise spectrum

multiplication, and inverse FFT are implemented on GPU. On a single node im-

plementation, parallelism with two GPUs is achieved either by having one CPU

extract features sequentially and cycle them between each GPU, or by having two

CPUs divide the work each with their own GPU. The multiple node implemen-

tation does not process features sequentially, and GPU parallelism is achieved by

assigning one GPU to each CPU socket. In this case the GPU simply takes the place

of the CPU in calculating the normalized cross-correlations.

Figure 8.6 The Nvidia G80 architecture, used in implementation of intensity feature matching.

Programs are decomposed as threads are executed on 128-stream processors, located in the central

row. The data are stored on L1 and L2 caches and video memory is located in lower rows.

Next Page

High-Throughput Image Reconstruction and Analysis

Search WWH ::

Custom Search

Home