Digital Signal Processing Reference
In-Depth Information
Fig. 8
Performance of the
3
3 median filter: on
1280
×
960 images as the
parallelization configuration
changes. Block width is fixed
to tdx
×
32. Best performance
is achieved with
tdx
=
×
tdy
=
32
×
4and n ppt
=
4
Fig. 9
3 median filter: comparison of the texture memory kernel and the
proposed shared memory kernel on a Tesla C2050 GPU for the best-performing parallelization
configuration
Performance of the 3
×
The median filter is always compute bound and performs best with tdx
×
tdy
=
32
×
4 threads and n ppt =
4. The results of the parameter study for tdx
=
32 are shown
in Fig. 8 . Configurations with n ppt =
8 perform slightly worse although redundant
memory access is further reduced because of inefficient pipeline utilization. Pro-
cessing times for a 3
3 median filter (i.e. kernel radius K =
×
1) are given in Fig. 9
resulting in 0
64 ms for the new shared memory based kernel. For a texture-memory
based kernel, which is the most often suggested way of implementing a 2D non-
separable filter, processing time is which is 2
.
.
77 ms. In comparison, this yields a
speed-up of 4
.
3 when processing a 1280
×
960 image.
9 rank transform (i.e. K =
For a 9
×
4) experiments showed that a block size
of tdx
×
tdy
=
32
×
4 with n ppt =
4 yields best performance. A speed up of 4
.
0is
obtained switching from the texture-based kernel (3
.
13 ms) to the shared memory
kernel (0
.
78 ms) for 1280
×
960 images.
 
 
Search WWH ::




Custom Search