Non-separable 2D, 3D, and 4D Filtering with CUDA - GPU Pro: Advanced Rendering Techniques

Graphics Reference

In-Depth Information

800

Shared

Shared unrolled

700

FFT

600

500

400

300

200

100

0

2

4

6

8

10

12

14

16

18

Filter Size

Figure 5.10. Performance, measured in megavoxels per second, for the different imple-

mentations of 4D filtering, for a dataset of size 128 × 128 × 128 × 32 and filter sizes

ranging from 3 × 3 × 3 × 3to17 × 17 × 17 × 17.

can for example handle larger datasets. In our work on 4D image denoising

[Eklund et al. 11], the FFT-based approach was on average only three times faster

(compared to about 30 times faster in the benchmarks given here). The main

reason for this was the high-resolution nature of the data (512 × 512 × 445 × 20

elements), making it impossible to load all the data into global memory. Due to

its higher memory consumption, the FFT-based approach was forced to load a

smaller number of slices into global memory compared to the spatial approach.

As only a subset of the slices (and time points) is valid after the filtering, the

FFT-based approach required a larger number of runs to process all the slices.

Finally, we close by noting two additional topics that readers may wish to

consider for more advanced study. First, applications in which several filters

are applied simultaneously to the same data (e.g, six complex valued quadrature

filters to estimate a local structure tensor in 3D) can lead to different conclu-

sions regarding performance using spatial convolution versus FFT-based filter-

ing. Second, filter networks can be used to speed up spatial convolution by

combining the result of many small filter kernels, resulting in a proportionally

higher gain for 3D and 4D than for 2D convolution [Andersson et al. 99,Svens-

son et al. 05]. All the code for this chapter is available under GNU GPL 3 at

https://github.com/wanderine/NonSeparableFilteringCUDA.

GPU Pro: Advanced Rendering Techniques

Search WWH ::

Custom Search

Home