Non-separable 2D, 3D, and 4D Filtering with CUDA - GPU Pro: Advanced Rendering Techniques

Graphics Reference

In-Depth Information

3000

Texture

Texture unrolled

2500

Shared

Shared unrolled

FFT

2000

1500

1000

500

0

2

4

6

8

10

12

14

16

18

Filter Size

Figure 5.8. Performance, measured in megavoxels per second, for the different imple-

mentations of 3D filtering, for a volume of size 256 × 256 × 256 and filter sizes ranging

from 3 × 3 × 3to17 × 17 × 17. FFT-based filtering is clearly faster than the other

approaches for large filters.

are for fixed filter sizes of 7

×

7

×

7

×

7and11

×

11

×

11

×

11, respectively. The

data sizes range from 128

128 in steps of

eight time points. All plots contain the processing time for spatial convolution

using shared memory (with and without loop unrolling) and FFT-based filtering

using CUFFT. The CUFFT library does not directly support 4D FFTs; it was

performed by running two batches of 2D FFTs and changing the order of the

data between them from ( x , y , z , t )to( z , t , x , y ). A 4D FFT developed by

Nvidia would, however, probably be more ecient.

×

128

×

64

×

16 to 128

×

128

×

64

×

5.10 Conclusions

We have presented solutions for fast non-separable floating point convolution in

2D, 3D, and 4D, using the CUDA programming language. We believe that these

implementations will serve as a complement to the NPP library, which currently

only supports 2D filters and images stored as integers. The shared memory

implementation with loop unrolling is approximately twice as fast as the simple

texture memory implementation, which is similar to results obtained by Nvidia for

separable 2D convolution [Podlozhnyuk 07]. For 3D and 4D data it might seem

strange to use convolution instead of an FFT, but the convolution approach

GPU Pro: Advanced Rendering Techniques

Search WWH ::

Custom Search

Home