Performing Image Processing Techniques - Direct3D Rendering

Graphics Reference

In-Depth Information

The box blur filter used in this recipe is separable, meaning that the vertical and horizontal

components of the filter can be applied separately (one after the other) and still achieve

the same result—generally this results in a much faster algorithm. For example, we can take

the separable 3x3-tap box blur kernel shown in the previous figure and split it into two 3-tap

kernels applied in two passes. The 3-tap horizontal kernel and 3-tap vertical kernel can be

applied in any order (although the second filter must use the output of the first as its input).

The final result is that instead of requiring nine samples per texel, the exact same output is

achieved using only six samples per texel; for a 9x9, the difference is even greater, requiring

only 18 samples per texel instead of 81. For smaller filters this is also possible within a pixel

shader using a bilinear filter and the Gather texture sample command.

The coefficients of the resulting 1D filters must be normalized; therefore, instead of each

texel contributing one ninth of the final result in the previous 3x3 box blur filter, each texel

will contribute one third of the final result for the 3-tap horizontal/vertical filter.

To further reduce the number of samples required, we have taken advantage of the compute

shader's local group shared memory. The actual number of samples is now close to two

samples per texel instead of 6.6. Of course, the texture sampling isn't the only overhead;

although the 32 KB shared memory sits closely with each of the SIMD units on the hardware,

it still incurs a cost. We do this by first loading a texel into the group-shared memory for

each thread within the group. After all threads in the current thread group have loaded their

texel, each thread then applies the blur kernel to its texel accessing the cached values of

neighboring texels from the group-shared memory. The following code snippet highlights

the process:

// 1. Sample the texel for current thread and place in group

// shared memory

...

// 2. Wait for all threads in group to complete sampling

GroupMemoryBarrierWithGroupSync();

// 3. Apply kernel to current texel, reading neighboring texels

// from group shared memory. Write result to output UAV

...

To deal with the threads at the edge of the thread group, we need to load an additional

FILTERRADIUS*2 texels (that is, for a 5-tap blur, we need to load an additional texel

for the first three and the last three threads on a row for the horizontal blur, or a column

for the vertical blur). Our group-shared memory is set up to fit the thread's group size

plus an additional ( THREADSY *FILTERRADIUS*2) for the horizontal shader,

and ( THREADSX *FILTERRADIUS*2) for the vertical shader (see the outside

elements in the following group-shared memory layout diagram):

Search WWH ::

Custom Search

Home