Non-separable 2D, 3D, and 4D Filtering with CUDA - GPU Pro: Advanced Rendering Techniques

Graphics Reference

In-Depth Information

have 48 KB of shared memory per MP. If one only considers the number of valid

filter responses generated per thread block, the optimal solution is to use all the

shared memory for a single thread block, since this would waste a minimum of

memory on the “halo” of invalid filter responses at the outer edges. According

to the CUDA programming guide, GPUs with compute capability 3.0 (e.g., the

Nvidia GTX 680) can maximally handle 1024 threads per thread block and 2048

concurrent threads per MP. Using all the shared memory for one thread block

would therefore lead to 50% of the possible computational performance, as only

1024 threads can be used in one thread block. Full occupancy can be achieved

by instead dividing the 48 KB of shared memory into two thread blocks. For

floating point convolution, 96 × 64 pixel values can be fitted into 24 KB of shared

memory. The 1024 threads per thread block are arranged as 32 threads along x

and 32 threads along y , to achieve coalesced reads from global memory and to

fit the number of banks in shared memory (32). Each thread starts by reading

six values from global memory into shared memory (96

64 / 1024 = 6). For a

maximum filter size of 17

17 pixels, 80

48 valid filter responses can then be

calculated from the 96

64 values in shared memory, since a halo of size 8 on

all sides is required. All threads start by first calculating two filter responses,

yielding 64

32 values. Half of the threads then calculate an additional three

filter responses, giving a total of 48

32 filter responses. Finally, a quarter of the

threads are used to calculate the filter responses for the last 16

16 pixels. The

division of the 80

48 values into six blocks is illustrated in Figure 5.5. The first

part of the code for non-separable 2D convolution using shared memory is given

in Listing 5.2 and the second part is given in Listing 5.3. The device function

that performs the 2D convolution is very similar to the kernel for texture-based

convolution; interested readers are therefore referred to the repository.

If more than one filter is to be applied, e.g., four complex valued quadrature

filters oriented along 0, 45, 90, and 135 degrees, all the filter responses can be

calculated very eciently by simply performing several multiplications and addi-

tions each time a pixel value has been loaded from shared memory to a register.

This results in a better ratio between the number of memory accesses and floating

point operations. By reducing the maximum filter size to 9

9, the number of

valid filter responses increases to 88

56 since the halo size shrinks to 4. This

will also result in a higher occupancy during the convolution. For the first case

yielding 80

48 valid filter responses, the mean occupancy for the six blocks is

(32

2+32

2+16

2) / (6

2048) = 62.5%,

and for the second case yielding 88

56 valid filter responses, the mean occupancy

increases to (32

2+32

2+24

2) / (6

2048)

= 80.2%.

The required number of thread blocks in the x -and y -directions are for the

shared memory implementation not calculated by dividing the image width and

height with the number of threads in each direction (32). The width and height

should instead be divided by the number of valid filter responses generated in each

GPU Pro: Advanced Rendering Techniques

Search WWH ::

Custom Search

Home