Graphics Reference
In-Depth Information
We could also manually take multiple samples to achieve the same result.
Basically, instead of sampling quads, we sample elongated rectangles at grazing
angles.
We saw earlier in Section 4.4.5 that for complicated BRDF models we would
need to pre-compute a 2D table of local reflection vectors and cone angles. A
texture suited for this is R16G16B16A16. The RGB channels would store the
local vector and the alpha channel would store either one isotropic cone-angle
extent or two anisotropic vertical and horizontal cone-angle extents. These two
anisotropic values for the cone would decide how many extra samples we would
take vertically to approximate an elongated rectangle to stretch the reflections.
4.7 Optimizations
4.7.1 Combining Linear and Hi-Z Traversal
One drawback of the Hierarchical-Z traversal is that it is going to traverse down
to lower hierarchy levels when the ray travels close to a surface. Evaluating the
entire Hierarchical-Z traversal algorithm for such small steps is more expensive
than doing a simple linear search with the same step size. Unfortunately the ray
starts immediately close to a surface, the surface we are reflecting the original
ray from. Doing a few steps of linear search in the beginning seems to be a great
optimization to get the ray away from the surface and then let the Hierarchical-Z
traversal algorithm do its job of taking the big steps.
In case the linear search finds intersections, we can just early-out in the shader
code with a dynamic branch and skip the entire Hi-Z traversal phase. It's also
worth it to end the Hi-Z traversal at a much earlier level such as 1 or 2 and
then continue with another linear search in the end. The ending level could be
calculated depending on the distance to the camera, since the farther away the
pixel is, the less detail it needs because of perspective, so stopping much earlier
is going to give a boost in performance.
4.7.2 Improving Fetch Latency
Partially unrolling dynamic loops to handle dependent texture fetches tends to
improve performance with fetch/latency-bound algorithms. So, instead of han-
dling one work per thread, we would actually pre-fetch the work for the next N
loops. We can do this because we have a deterministic path on our ray. However,
there is a point where pre-fetching starts to hurt performance because the reg-
ister usage rises and using more registers means less buckets of threads can run
in parallel. A good starting point is N = 4. That value was used on a regular
linear tracing algorithm and a speedup of 2
wasmeasuredonbothNVIDIA
and AMD hardware. The numbers appearing later in this chapter do not include
these improvements because it wasn't tested on a Hi-Z tracer.
×
-3
×
Search WWH ::




Custom Search