Digital Signal Processing Reference
In-Depth Information
tasks (e.g. disparity estimation with optical flow). In [ 28 ] an algorithm for joint
computation of disparity estimation and optical flow is proposed and implemented
on the GPU. A holistic architecture for phase based disparity estimation, optical
flow, and more is presented [ 85 ] and implemented on an FPGA. An holistic
architecture for disparity estimation and motion estimation based on SAD is
presented in [ 102 ] .
3.6
Implementation Example: Semi-global
Matching on the GPU
An example implementation of the semi-global matching algorithm for GPUs will
be given based on the works in [ 4 ] . Since GPUs are becoming more and more
common, an introduction of the architecture and the terminology will be skipped.
Please refer to the Nvidia manuals and [ 35 ] for a detailed background on GPU
architecture or directly to [ 4 ] for a short sketch. The evaluation platform in the
following is a Nvidia Tesla C2050 with compute capability 2.0 providing 3 GB
DDRRAM global memory with a maximum theoretical bandwidth of 144 GB
/
s.
3.6.1
Parallelization Principles
Banz et al. [ 4 ] formulate the following performance limiting factors for a kernel:
￿
Effective memory bandwidth usage for the payload data which is e.g. reduced
by nonaligned, overhead-producing memory access
￿
Instruction throughput defined as the number of instructions performing
arithmetics for the core computation and other non-ancillary instructions per unit
of time
￿
Latency of the memory interface occurring e.g. when accessing scattered
memory locations even if aligned and coalesced, warp-wise access is performed
￿
Latency of the arithmetic pipeline of the ALUs inside the GPU cores if
arithmetic instructions depend on each other and can only be executed with the
result from the previous instruction
Accordingly, kernels can be memory bound, compute bound or latency bound.
Kernels that are not limited by any of the three bounds are ill-adapted for GPU
implementation and can be classified as bound by their parallelization scheme.
An efficient parallelization scheme guarantees inherently aligned and coalesced
data access schemes without instruction overhead. Coalesced memory access is
the simultaneous memory access to consecutive memory locations of all threads
of a warp. It further includes a combination of parallel and sequential processing
with independent arithmetic computation steps. An inner (sequential) loop in the
otherwise parallel threads working on a set of data that is kept in shared memory or
 
 
Search WWH ::




Custom Search