Graphics Reference
In-Depth Information
Figure 5.5. The grid represents 96 × 64 pixels in shared memory. As 32 × 32 threads are
used per thread block, each thread needs to read six values from global memory into
shared memory. The gray pixels represent the filter kernel and the black pixel represents
where the current filter response is saved. A yellow halo needs to be loaded into shared
memory to be able to calculate all the filter responses. In this case 80 × 48 valid filter
responses are calculated, making it possible to apply at most a filter of size 17 × 17.
The 80 × 48 filter responses are calculated as six runs, the first 2 consisting of 32 × 32
pixels (marked light red and light blue). Half of the threads calculate three additional
filter responses in blocks of 32 × 16 or 16 × 32 pixels (marked green, dark blue, and dark
red). A quarter of the threads calculates the filter response for a last block of 16 × 16
pixels (marked purple). If the halo is reduced from eight to four pixels, 88 × 56 valid
filter responses can instead be calculated as two 32 × 32 blocks, one 24 × 32 block, two
32 × 24 blocks, and one 24 × 24 block. In addition to increasing the number of valid
filter responses, such an implementation will also increase the mean occupancy during
convolution from 62.5% to 80.2%. The only drawback is that the largest filter that can
be applied drops from 17 × 17 to 9 × 9.
direction, since each thread block generates more than one valid filter response
per thread. The calculation of the x -and y -indices inside the kernel also needs
to be changed from the conventional
int x = blockIdx . x ￿ blockDim . x + threadIdx . x ;
int y = blockIdx . y
blockDim . y + threadIdx . y ;
to
int x = blockIdx . x ￿ VALID_RESPONSES_X + threadIdx . x ;
int y = blockIdx . y ￿ VALID_RESPONSES_Y + threadIdx . y ;
Search WWH ::




Custom Search