Digital Signal Processing Reference
In-Depth Information
Fig. 12 Impact of the parallelization configuration on the performance of the concurrent path cost
calculation for eight paths of the SGM for 1280
×
960 images and 128 disparity levels. Block width
and tile width are both fixed to tdx
=
32. Best performance is achieved with tdx
×
tdy
=
32
×
4(i.e.
each inner loop processes 32 disparity levels) and t y
=
16
of keeping the processing implementation unchanged but rearranging the data in
the memory creates an inherently contradictory situation: if the GPU is used to
rearrange the data, the re-sorting causes additional memory access with is not even
coalesced.
Again, parameters adjustment allows to navigate between the performance
optimization principles. The first parameter ( tdy ) trades thread parallelism against
sequential computation in the inner loop for all kernels. The second parameter
( t y ) trades the number of parallelly processable blocks versus launch overhead and
memory overhead for the four diagonal paths. Figure 12 shows the result of the
parameter study. Choosing tdy
=
4and t y =
16 results in best performance (39
.
8ms
and 39
960 image. If the concurrent kernel execution is not
used, performance is approximately halved (75
.
7GB
/
s) for a 1280
×
.
7 ms and 20
.
9GB
/
s). Both kernel
sets, concurrent and sequential, are latency bound.
Summation of the eight path cost spaces ( 7 ) and winner-takes-all disparity
selection ( 9 ) can be performed independently for each pixel allowing for the same
parallelization scheme as for the MC calculation. This kernel ( sum wta ) requires
15
.
1 ms and is memory bound with 117
.
4GB
/
s.
3.6.4
Performance
The processing time for the complete disparity estimation including rank transform,
semi-global matching for eight paths, disparity map generation (without left/right
check ( 3 ) ) and median filtering on a Tesla C2050 Fermi architecture GPU is
summarized in Table 2 . Overall, a 1280
×
960 image with 128 disparity levels
requires 56
2 ms. The processing times do not include data transfer between host
and GPU because it can be effectively hidden using concurrent data transfer
when processing image streams. When processing 1280
.
×
960 image sets ca. 5 ms
additional transfer time is required.
 
 
Search WWH ::




Custom Search