Image Processing - Practical Rendering and Computation with Direct3D 11

Graphics Reference

In-Depth Information

Since we will be modifying the filter kernel that we will be executing, it makes sense

to reevaluate the threading setup that we have used in the previous implementation. Instead

of using a square thread group size, it makes more sense to flatten out our thread groups to

match the shape of the processing kernel being used in each pass. Thus, for the first pass we

will use a thread group size of [640,1,1], and the second pass will use a thread group size

of [1,480,1]. This will allow us to use the group shared memory to even further reduce the

required device memory bandwidth.

In our previous implementation, each thread read all of the input texture values it

needed to calculate its own output value. For a 7×7 filter, this amounts to 49 individual

values to read from the input image per thread. In a separable implementation, we can

expect each thread to only need to read the data for either a 1×7 or a 7×1 filter, for a total

of 7 + 7 = 14. This is already a significant reduction of explicit memory reads, although the

effective reduction may be somewhat smaller due to the GPU's texture data caching help-

ing the naive implementation. In any case, we can further reduce the number of reads from

the input texture resource by using the GSM. If each thread reads its own input pixel data

and then stores it in the group shared memory for all the threads in the thread group to use,

the effective number of device memory reads per thread is reduced from 7 + 7 to 1 + 1! Of

course we are adding some overhead to this implementation by writing to and reading from

the group shared memory, but in general, this should be a performance improvement over

performing many more read operations from device memory. After all of the threads have

loaded their data into the GSM, we perform a group memory barrier with thread sync to

ensure that all of the needed data has been written to the GSM before moving on with the

filtering operations. The updated compute shader program for the horizontal filter is shown

in Listing 10.2. The vertical version is omitted, since it is identical to the shown version

except that it samples in a vertical pattern and declares a different amount of group shared

memory.

// Declare the input and output resources

Texture2D<float4> InputMap : register( t0 );

RWTexture2D<float4> OutputMap : register( u8 );

// Image sizes

#define size_x 648

#define size_y 480

// Declare the filter kernel coefficients

static const float filter[7] = {

0.030078323, 0.104983664, 0.222250419, 0.285375187, 0.222250419,

0.104983664, 0.030078323

};

// Declare the group shared memory to hold the loaded data

groupshared float4 horizontalpoints[size_x];

Search WWH ::

Custom Search

Home