Architectures for Stereo Vision - Signal Processing Systems

Digital Signal Processing Reference

In-Depth Information

tasks (e.g. disparity estimation with optical flow). In [ 28 ] an algorithm for joint

computation of disparity estimation and optical flow is proposed and implemented

on the GPU. A holistic architecture for phase based disparity estimation, optical

flow, and more is presented [ 85 ] and implemented on an FPGA. An holistic

architecture for disparity estimation and motion estimation based on SAD is

presented in [ 102 ] .

3.6

Implementation Example: Semi-global

Matching on the GPU

An example implementation of the semi-global matching algorithm for GPUs will

be given based on the works in [ 4 ] . Since GPUs are becoming more and more

common, an introduction of the architecture and the terminology will be skipped.

Please refer to the Nvidia manuals and [ 35 ] for a detailed background on GPU

architecture or directly to [ 4 ] for a short sketch. The evaluation platform in the

following is a Nvidia Tesla C2050 with compute capability 2.0 providing 3 GB

DDRRAM global memory with a maximum theoretical bandwidth of 144 GB

/

s.

3.6.1

Parallelization Principles

Banz et al. [ 4 ] formulate the following performance limiting factors for a kernel:

Effective memory bandwidth usage for the payload data which is e.g. reduced

by nonaligned, overhead-producing memory access

Instruction throughput defined as the number of instructions performing

arithmetics for the core computation and other non-ancillary instructions per unit

of time

Latency of the memory interface occurring e.g. when accessing scattered

memory locations even if aligned and coalesced, warp-wise access is performed

Latency of the arithmetic pipeline of the ALUs inside the GPU cores if

arithmetic instructions depend on each other and can only be executed with the

result from the previous instruction

Accordingly, kernels can be memory bound, compute bound or latency bound.

Kernels that are not limited by any of the three bounds are ill-adapted for GPU

implementation and can be classified as bound by their parallelization scheme.

An efficient parallelization scheme guarantees inherently aligned and coalesced

data access schemes without instruction overhead. Coalesced memory access is

the simultaneous memory access to consecutive memory locations of all threads

of a warp. It further includes a combination of parallel and sequential processing

with independent arithmetic computation steps. An inner (sequential) loop in the

otherwise parallel threads working on a set of data that is kept in shared memory or

Search WWH ::

Custom Search

Home