Digital Signal Processing Reference
In-Depth Information
computation of disparity estimation and optical flow is proposed and implemented
on the GPU. A holistic architecture for phase based disparity estimation, optical
architecture for disparity estimation and motion estimation based on SAD is
3.6
Implementation Example: Semi-global
Matching on the GPU
An example implementation of the semi-global matching algorithm for GPUs will
common, an introduction of the architecture and the terminology will be skipped.
following is a Nvidia Tesla C2050 with compute capability 2.0 providing 3 GB
DDRRAM global memory with a maximum theoretical bandwidth of 144 GB
/
s.
3.6.1
Parallelization Principles
Effective memory bandwidth usage
for the payload data which is e.g. reduced
by nonaligned, overhead-producing memory access
Instruction throughput
defined as the number of instructions performing
arithmetics for the core computation and other non-ancillary instructions per unit
of time
Latency of the memory interface
occurring e.g. when accessing scattered
memory locations even if aligned and coalesced, warp-wise access is performed
Latency of the arithmetic pipeline
of the ALUs inside the GPU cores if
arithmetic instructions depend on each other and can only be executed with the
result from the previous instruction
Accordingly, kernels can be memory bound, compute bound or latency bound.
Kernels that are not limited by any of the three bounds are ill-adapted for GPU
implementation and can be classified as bound by their parallelization scheme.
An efficient parallelization scheme guarantees inherently
aligned
and
coalesced
data access schemes without instruction overhead. Coalesced memory access is
the simultaneous memory access to consecutive memory locations of all threads
of a warp. It further includes a combination of parallel and sequential processing
with independent arithmetic computation steps. An inner (sequential) loop in the
otherwise parallel threads working on a set of data that is kept in shared memory or