Non-separable 2D, 3D, and 4D Filtering with CUDA - GPU Pro: Advanced Rendering Techniques - page 478

Graphics Reference

In-Depth Information

Figure 5.5. The grid represents 96 × 64 pixels in shared memory. As 32 × 32 threads are

used per thread block, each thread needs to read six values from global memory into

shared memory. The gray pixels represent the filter kernel and the black pixel represents

where the current filter response is saved. A yellow halo needs to be loaded into shared

memory to be able to calculate all the filter responses. In this case 80 × 48 valid filter

responses are calculated, making it possible to apply at most a filter of size 17 × 17.

The 80 × 48 filter responses are calculated as six runs, the first 2 consisting of 32 × 32

pixels (marked light red and light blue). Half of the threads calculate three additional

filter responses in blocks of 32 × 16 or 16 × 32 pixels (marked green, dark blue, and dark

red). A quarter of the threads calculates the filter response for a last block of 16 × 16

pixels (marked purple). If the halo is reduced from eight to four pixels, 88 × 56 valid

filter responses can instead be calculated as two 32 × 32 blocks, one 24 × 32 block, two

32 × 24 blocks, and one 24 × 24 block. In addition to increasing the number of valid

filter responses, such an implementation will also increase the mean occupancy during

convolution from 62.5% to 80.2%. The only drawback is that the largest filter that can

be applied drops from 17 × 17 to 9 × 9.

direction, since each thread block generates more than one valid filter response

per thread. The calculation of the x -and y -indices inside the kernel also needs

to be changed from the conventional

int x = blockIdx . x blockDim . x + threadIdx . x ;

int y = blockIdx . y

blockDim . y + threadIdx . y ;

to

int x = blockIdx . x VALID_RESPONSES_X + threadIdx . x ;

int y = blockIdx . y VALID_RESPONSES_Y + threadIdx . y ;

Next Page

GPU Pro: Advanced Rendering Techniques

Search WWH ::

Custom Search

Home