Graphics Reference
In-Depth Information
have 48 KB of shared memory per MP. If one only considers the number of valid
filter responses generated per thread block, the optimal solution is to use all the
shared memory for a single thread block, since this would waste a minimum of
memory on the “halo” of invalid filter responses at the outer edges. According
to the CUDA programming guide, GPUs with compute capability 3.0 (e.g., the
Nvidia GTX 680) can maximally handle 1024 threads per thread block and 2048
concurrent threads per MP. Using all the shared memory for one thread block
would therefore lead to 50% of the possible computational performance, as only
1024 threads can be used in one thread block. Full occupancy can be achieved
by instead dividing the 48 KB of shared memory into two thread blocks. For
floating point convolution, 96 × 64 pixel values can be fitted into 24 KB of shared
memory. The 1024 threads per thread block are arranged as 32 threads along x
and 32 threads along y , to achieve coalesced reads from global memory and to
fit the number of banks in shared memory (32). Each thread starts by reading
six values from global memory into shared memory (96
×
64 / 1024 = 6). For a
maximum filter size of 17
×
17 pixels, 80
×
48 valid filter responses can then be
calculated from the 96
64 values in shared memory, since a halo of size 8 on
all sides is required. All threads start by first calculating two filter responses,
yielding 64
×
32 values. Half of the threads then calculate an additional three
filter responses, giving a total of 48
×
32 filter responses. Finally, a quarter of the
threads are used to calculate the filter responses for the last 16
×
×
16 pixels. The
division of the 80
48 values into six blocks is illustrated in Figure 5.5. The first
part of the code for non-separable 2D convolution using shared memory is given
in Listing 5.2 and the second part is given in Listing 5.3. The device function
that performs the 2D convolution is very similar to the kernel for texture-based
convolution; interested readers are therefore referred to the repository.
If more than one filter is to be applied, e.g., four complex valued quadrature
filters oriented along 0, 45, 90, and 135 degrees, all the filter responses can be
calculated very eciently by simply performing several multiplications and addi-
tions each time a pixel value has been loaded from shared memory to a register.
This results in a better ratio between the number of memory accesses and floating
point operations. By reducing the maximum filter size to 9
×
×
9, the number of
valid filter responses increases to 88
56 since the halo size shrinks to 4. This
will also result in a higher occupancy during the convolution. For the first case
yielding 80
×
×
48 valid filter responses, the mean occupancy for the six blocks is
(32
·
32
·
2+32
·
32
·
2+32
·
16
·
2+32
·
16
·
2+16
·
32
·
2+16
·
16
·
2) / (6
·
2048) = 62.5%,
and for the second case yielding 88
×
56 valid filter responses, the mean occupancy
increases to (32
·
32
·
2+32
·
32
·
2+32
·
24
·
2+32
·
24
·
2+24
·
32
·
2+24
·
24
·
2) / (6
·
2048)
= 80.2%.
The required number of thread blocks in the x -and y -directions are for the
shared memory implementation not calculated by dividing the image width and
height with the number of threads in each direction (32). The width and height
should instead be divided by the number of valid filter responses generated in each
Search WWH ::




Custom Search