Biomedical Engineering Reference
In-Depth Information
(a cost which can later be amortized by loading the plan from disk at runtime on
subsequent transforms of the same size).
8.4.3 GPU Acceleration
To further speed up the demanding process of intensity feature matching, GPU can
be employed to perform the FFT operations used in normalized cross-correlation
calculation. The FFT operation is a good candidate for GPU acceleration since it
is a highly parallel ''divide and conquer'' algorithm. In general, the performance
of algorithms on GPUs depends on how well parallelism, closer memory, bus
bandwidth, and GFLOPs are exploited.
General Purpose GPU: CUDA
GPU resources are a specialized set of registers and instructions intended specifi-
cally for common graphics operations. Programming models exist to access these
resources for more general purpose tasks. One such interface, the Compute Uni-
fied Device Architecture (CUDA) [47], consists of a set of library functions which
can be coded as an extension of the C language. The CUDA has an associated
compiler that generates code acceptable for the GPU, which is seen by the CPU as
a multicore processor. The CUDA hardware interface, shown in Figure 8.7, hides
the graphics pipeline notions (see Figure 8.6), instead presenting the GPU resources
as a collection of threads running in parallel.
FFT on GPU
To accelerate intensity feature matching, the forward FFTs, pointwise spectrum
multiplication, and inverse FFT are implemented on GPU. On a single node im-
plementation, parallelism with two GPUs is achieved either by having one CPU
extract features sequentially and cycle them between each GPU, or by having two
CPUs divide the work each with their own GPU. The multiple node implemen-
tation does not process features sequentially, and GPU parallelism is achieved by
assigning one GPU to each CPU socket. In this case the GPU simply takes the place
of the CPU in calculating the normalized cross-correlations.
Figure 8.6 The Nvidia G80 architecture, used in implementation of intensity feature matching.
Programs are decomposed as threads are executed on 128-stream processors, located in the central
row. The data are stored on L1 and L2 caches and video memory is located in lower rows.
 
Search WWH ::




Custom Search