Digital Signal Processing Reference
In-Depth Information
Fig. 12 Impact of the parallelization configuration on the performance of the concurrent path cost
calculation for eight paths of the SGM for 1280
960 images and 128 disparity levels. Block width
and tile width are both fixed to tdx
32. Best performance is achieved with tdx
each inner loop processes 32 disparity levels) and t y
of keeping the processing implementation unchanged but rearranging the data in
the memory creates an inherently contradictory situation: if the GPU is used to
rearrange the data, the re-sorting causes additional memory access with is not even
Again, parameters adjustment allows to navigate between the performance
optimization principles. The first parameter ( tdy ) trades thread parallelism against
sequential computation in the inner loop for all kernels. The second parameter
( t y ) trades the number of parallelly processable blocks versus launch overhead and
memory overhead for the four diagonal paths. Figure 12 shows the result of the
parameter study. Choosing tdy
4and t y =
16 results in best performance (39
and 39
960 image. If the concurrent kernel execution is not
used, performance is approximately halved (75
s) for a 1280
7 ms and 20
s). Both kernel
sets, concurrent and sequential, are latency bound.
Summation of the eight path cost spaces ( 7 ) and winner-takes-all disparity
selection ( 9 ) can be performed independently for each pixel allowing for the same
parallelization scheme as for the MC calculation. This kernel ( sum wta ) requires
1 ms and is memory bound with 117
The processing time for the complete disparity estimation including rank transform,
semi-global matching for eight paths, disparity map generation (without left/right
check ( 3 ) ) and median filtering on a Tesla C2050 Fermi architecture GPU is
summarized in Table 2 . Overall, a 1280
960 image with 128 disparity levels
requires 56
2 ms. The processing times do not include data transfer between host
and GPU because it can be effectively hidden using concurrent data transfer
when processing image streams. When processing 1280
960 image sets ca. 5 ms
additional transfer time is required.
Search WWH ::

Custom Search