Graphics Reference
In-Depth Information
Since we will be modifying the filter kernel that we will be executing, it makes sense
to reevaluate the threading setup that we have used in the previous implementation. Instead
of using a square thread group size, it makes more sense to flatten out our thread groups to
match the shape of the processing kernel being used in each pass. Thus, for the first pass we
will use a thread group size of [640,1,1], and the second pass will use a thread group size
of [1,480,1]. This will allow us to use the group shared memory to even further reduce the
required device memory bandwidth.
In our previous implementation, each thread read all of the input texture values it
needed to calculate its own output value. For a 7×7 filter, this amounts to 49 individual
values to read from the input image per thread. In a separable implementation, we can
expect each thread to only need to read the data for either a 1×7 or a 7×1 filter, for a total
of 7 + 7 = 14. This is already a significant reduction of explicit memory reads, although the
effective reduction may be somewhat smaller due to the GPU's texture data caching help-
ing the naive implementation. In any case, we can further reduce the number of reads from
the input texture resource by using the GSM. If each thread reads its own input pixel data
and then stores it in the group shared memory for all the threads in the thread group to use,
the effective number of device memory reads per thread is reduced from 7 + 7 to 1 + 1! Of
course we are adding some overhead to this implementation by writing to and reading from
the group shared memory, but in general, this should be a performance improvement over
performing many more read operations from device memory. After all of the threads have
loaded their data into the GSM, we perform a group memory barrier with thread sync to
ensure that all of the needed data has been written to the GSM before moving on with the
filtering operations. The updated compute shader program for the horizontal filter is shown
in Listing 10.2. The vertical version is omitted, since it is identical to the shown version
except that it samples in a vertical pattern and declares a different amount of group shared
memory.
// Declare the input and output resources
Texture2D<float4> InputMap : register( t0 );
RWTexture2D<float4> OutputMap : register( u8 );
// Image sizes
#define size_x 648
#define size_y 480
// Declare the filter kernel coefficients
static const float filter[7] = {
0.030078323, 0.104983664, 0.222250419, 0.285375187, 0.222250419,
0.104983664, 0.030078323
};
// Declare the group shared memory to hold the loaded data
groupshared float4 horizontalpoints[size_x];
Search WWH ::




Custom Search