Graphics Reference
In-Depth Information
The box blur filter used in this recipe is separable, meaning that the vertical and horizontal
components of the filter can be applied separately (one after the other) and still achieve
the same result—generally this results in a much faster algorithm. For example, we can take
the separable 3x3-tap box blur kernel shown in the previous figure and split it into two 3-tap
kernels applied in two passes. The 3-tap horizontal kernel and 3-tap vertical kernel can be
applied in any order (although the second filter must use the output of the first as its input).
The final result is that instead of requiring nine samples per texel, the exact same output is
achieved using only six samples per texel; for a 9x9, the difference is even greater, requiring
only 18 samples per texel instead of 81. For smaller filters this is also possible within a pixel
shader using a bilinear filter and the Gather texture sample command.
The coefficients of the resulting 1D filters must be normalized; therefore, instead of each
texel contributing one ninth of the final result in the previous 3x3 box blur filter, each texel
will contribute one third of the final result for the 3-tap horizontal/vertical filter.
To further reduce the number of samples required, we have taken advantage of the compute
shader's local group shared memory. The actual number of samples is now close to two
samples per texel instead of 6.6. Of course, the texture sampling isn't the only overhead;
although the 32 KB shared memory sits closely with each of the SIMD units on the hardware,
it still incurs a cost. We do this by first loading a texel into the group-shared memory for
each thread within the group. After all threads in the current thread group have loaded their
texel, each thread then applies the blur kernel to its texel accessing the cached values of
neighboring texels from the group-shared memory. The following code snippet highlights
the process:
// 1. Sample the texel for current thread and place in group
// shared memory
...
// 2. Wait for all threads in group to complete sampling
GroupMemoryBarrierWithGroupSync();
// 3. Apply kernel to current texel, reading neighboring texels
// from group shared memory. Write result to output UAV
...
To deal with the threads at the edge of the thread group, we need to load an additional
FILTERRADIUS*2 texels (that is, for a 5-tap blur, we need to load an additional texel
for the first three and the last three threads on a row for the horizontal blur, or a column
for the vertical blur). Our group-shared memory is set up to fit the thread's group size
plus an additional ( THREADSY *FILTERRADIUS*2) for the horizontal shader,
and ( THREADSX *FILTERRADIUS*2) for the vertical shader (see the outside
elements in the following group-shared memory layout diagram):
 
Search WWH ::




Custom Search