Digital Signal Processing Reference
In-Depth Information
Fig. 8
Performance of the
3
3 median filter: on
1280
×
960 images as the
parallelization configuration
changes. Block width is fixed
to
tdx
×
32. Best performance
is achieved with
tdx
=
×
tdy
=
32
×
4and
n
ppt
=
4
Fig. 9
3 median filter: comparison of the texture memory kernel and the
proposed shared memory kernel on a Tesla C2050 GPU for the best-performing parallelization
configuration
Performance of the 3
×
The median filter is always compute bound and performs best with
tdx
×
tdy
=
32
×
4 threads and
n
ppt
=
4. The results of the parameter study for
tdx
=
32 are shown
8 perform slightly worse although redundant
memory access is further reduced because of inefficient pipeline utilization. Pro-
cessing times for a 3
3 median filter (i.e. kernel radius
K
=
×
1) are given in Fig.
9
resulting in 0
64 ms for the new shared memory based kernel. For a texture-memory
based kernel, which is the most often suggested way of implementing a 2D non-
separable filter, processing time is which is 2
.
.
77 ms. In comparison, this yields a
speed-up of 4
.
3 when processing a 1280
×
960 image.
9 rank transform (i.e.
K
=
For a 9
×
4) experiments showed that a block size
of
tdx
×
tdy
=
32
×
4 with
n
ppt
=
4 yields best performance. A speed up of 4
.
0is
obtained switching from the texture-based kernel (3
.
13 ms) to the shared memory
kernel (0
.
78 ms) for 1280
×
960 images.