Graphics Reference
In-Depth Information
bandwidth because the entire scene will be processed for each pixel. If we use a
more sophisticated data structure, then we likely will reduce bandwidth but also
reduce memory coherence. Furthermore, adjacent pixels likely sample the same
triangle, but by the time we have iterated through to testing that triangle again it
is likely to have been flushed from the cache. A popular low-level optimization
for a ray tracer is to trace a bundle of rays called a ray packet through adjacent
pixels. These rays likely traverse the scene data structure in a similar way, which
increases memory coherence. On a SIMD processor a single thread can trace an
entire packet simultaneously. However, packet tracing suffers from computational
coherence problems. Sometimes different rays in the same packet progress to dif-
ferent parts of the scene data structure or branch different ways in the ray intersec-
tion test. In these cases, processing multiple rays simultaneously on a thread gives
no advantage because memory coherence is lost or both sides of the branch must
be taken. As a result, fast ray tracers are often designed to trace packets through
very sophisticated data structures. They are typically limited not by computation
but by memory performance problems arising from resultant cache inefficiency.
Because frame buffer storage per pixel is often much smaller than scene struc-
ture per triangle, the rasterizer has an inherent memory performance advantage
over the ray tracer. A rasterizer reads each triangle into memory and then pro-
cesses it to completion, iterating over many pixels. Those pixels must be adjacent
to each other in space. For a row-major image, if we iterate along rows, then
the pixels covered by the triangle are also adjacent in memory and we will have
excellent coherence and fairly low memory bandwidth in the inner loop. Further-
more, we can process multiple adjacent pixels, either horizontally or vertically,
simultaneously on a SIMD architecture. These will be highly memory and branch
coherent because we're stepping along a single triangle. There are many variations
on ray casting and rasterization that improve their asymptotic behavior. However,
these algorithms have historically been applied to only millions of triangles and
pixels. At those sizes, constant factors like coherence still drive the performance
of the algorithms, and rasterization's superior coherence properties have made it
preferred for high-performance rendering. The cost of this coherence is that after
even the few optimizations needed to get real-time performance from a raster-
izer, the code becomes so littered with bit-manipulation tricks and highly derived
terms that the elegance of a simple ray cast seems very attractive from a software
engineering perspective. This difference is only magnified when we make the ren-
dering algorithm more sophisticated. The conventional wisdom is that ray-tracing
algorithms are elegant and easy to extend but are hard to optimize, and rasteri-
zation algorithms are very efficient but are awkward and hard to augment with
new features. Of course, one can always make a ray tracer fast and ugly (which
packet tracing succeeds at admirably) and a rasterizer extensible but slow (e.g.,
Pixar's RenderMan, which was used extensively in film rendering over the past
two decades).
15.8.3 Early-Depth-Test Example
One simple optimization that can significantly improve performance, yet only
minimally affects clarity, is an early depth test. Both the rasterizer and the ray-
tracer structures sometimes shaded a point, only to later find that some other point
was closer to the surface. As an optimization, we might first find the closest point
before doing any shading, then go back and shade the point that was closest. In ray
 
 
Search WWH ::




Custom Search