Digital Signal Processing Reference
In-Depth Information
Fig. 12
Impact of the parallelization configuration on the performance of the concurrent path cost
calculation for eight paths of the SGM for 1280
×
960 images and 128 disparity levels. Block width
and tile width are both fixed to
tdx
=
32. Best performance is achieved with
tdx
×
tdy
=
32
×
4(i.e.
each inner loop processes 32 disparity levels) and
t
y
=
16
of keeping the processing implementation unchanged but rearranging the data in
the memory creates an inherently contradictory situation: if the GPU is used to
rearrange the data, the re-sorting causes additional memory access with is not even
coalesced.
Again, parameters adjustment allows to navigate between the performance
optimization principles. The first parameter (
tdy
) trades thread parallelism against
sequential computation in the inner loop for all kernels. The second parameter
(
t
y
) trades the number of parallelly processable blocks versus launch overhead and
memory overhead for the four diagonal paths. Figure
12
shows the result of the
parameter study. Choosing
tdy
=
4and
t
y
=
16 results in best performance (39
.
8ms
and 39
960 image. If the concurrent kernel execution is not
used, performance is approximately halved (75
.
7GB
/
s) for a 1280
×
.
7 ms and 20
.
9GB
/
s). Both kernel
sets, concurrent and sequential, are latency bound.
parallelization scheme as for the MC calculation. This kernel (
sum wta
) requires
15
.
1 ms and is memory bound with 117
.
4GB
/
s.
3.6.4
Performance
The processing time for the complete disparity estimation including rank transform,
semi-global matching for eight paths, disparity map generation (without left/right
×
960 image with 128 disparity levels
requires 56
2 ms. The processing times do not include data transfer between host
and GPU because it can be effectively hidden using concurrent data transfer
when processing image streams. When processing 1280
.
×
960 image sets ca. 5 ms
additional transfer time is required.