Architectures for Stereo Vision - Signal Processing Systems - page 492

Digital Signal Processing Reference

In-Depth Information

Fig. 12 Impact of the parallelization configuration on the performance of the concurrent path cost

calculation for eight paths of the SGM for 1280

×

960 images and 128 disparity levels. Block width

and tile width are both fixed to tdx

=

32. Best performance is achieved with tdx

×

tdy

=

32

×

4(i.e.

each inner loop processes 32 disparity levels) and t y

=

16

of keeping the processing implementation unchanged but rearranging the data in

the memory creates an inherently contradictory situation: if the GPU is used to

rearrange the data, the re-sorting causes additional memory access with is not even

coalesced.

Again, parameters adjustment allows to navigate between the performance

optimization principles. The first parameter ( tdy ) trades thread parallelism against

sequential computation in the inner loop for all kernels. The second parameter

( t y ) trades the number of parallelly processable blocks versus launch overhead and

memory overhead for the four diagonal paths. Figure 12 shows the result of the

parameter study. Choosing tdy

=

4and t y =

16 results in best performance (39

.

8ms

and 39

960 image. If the concurrent kernel execution is not

used, performance is approximately halved (75

.

7GB

/

s) for a 1280

×

.

7 ms and 20

.

9GB

/

s). Both kernel

sets, concurrent and sequential, are latency bound.

Summation of the eight path cost spaces ( 7 ) and winner-takes-all disparity

selection ( 9 ) can be performed independently for each pixel allowing for the same

parallelization scheme as for the MC calculation. This kernel ( sum wta ) requires

15

.

1 ms and is memory bound with 117

.

4GB

/

s.

3.6.4

Performance

The processing time for the complete disparity estimation including rank transform,

semi-global matching for eight paths, disparity map generation (without left/right

check ( 3 ) ) and median filtering on a Tesla C2050 Fermi architecture GPU is

summarized in Table 2 . Overall, a 1280

×

960 image with 128 disparity levels

requires 56

2 ms. The processing times do not include data transfer between host

and GPU because it can be effectively hidden using concurrent data transfer

when processing image streams. When processing 1280

.

×

960 image sets ca. 5 ms

additional transfer time is required.

Next Page

Signal Processing Systems

Search WWH ::

Custom Search

Home