Graphics Reference
In-Depth Information
3000
Texture
Texture unrolled
2500
Shared
Shared unrolled
FFT
2000
1500
1000
500
0
2
4
6
8
10
12
14
16
18
Filter Size
Figure 5.8. Performance, measured in megavoxels per second, for the different imple-
mentations of 3D filtering, for a volume of size 256 × 256 × 256 and filter sizes ranging
from 3 × 3 × 3to17 × 17 × 17. FFT-based filtering is clearly faster than the other
approaches for large filters.
are for fixed filter sizes of 7
×
7
×
7
×
7and11
×
11
×
11
×
11, respectively. The
data sizes range from 128
128 in steps of
eight time points. All plots contain the processing time for spatial convolution
using shared memory (with and without loop unrolling) and FFT-based filtering
using CUFFT. The CUFFT library does not directly support 4D FFTs; it was
performed by running two batches of 2D FFTs and changing the order of the
data between them from ( x , y , z , t )to( z , t , x , y ). A 4D FFT developed by
Nvidia would, however, probably be more ecient.
×
128
×
64
×
16 to 128
×
128
×
64
×
5.10 Conclusions
We have presented solutions for fast non-separable floating point convolution in
2D, 3D, and 4D, using the CUDA programming language. We believe that these
implementations will serve as a complement to the NPP library, which currently
only supports 2D filters and images stored as integers. The shared memory
implementation with loop unrolling is approximately twice as fast as the simple
texture memory implementation, which is similar to results obtained by Nvidia for
separable 2D convolution [Podlozhnyuk 07]. For 3D and 4D data it might seem
strange to use convolution instead of an FFT, but the convolution approach
Search WWH ::




Custom Search