Graphics Reference
In-Depth Information
During ray marching, enqueued rays are output in an arbitrary order, making
use of structured unordered access views enhanced with a global atomic counter
( D3D11_BUFFER_UAV_FLAG_COUNTER creation flag). A slot in the buffer is reserved
by calling RWStructuredBuffer.IncrementCounter() . The function yields an index
where the enqueued ray may be stored. Note that this is much more ecient than
using InterlockedAdd() on a counter value stored in a global memory buffer, as
IncrementCounter() is optimized for highly concurrent access.
For each occupied grid cell crossed during ray marching, we output the ray
index annotated with the morton code of the crossed cell. The morton code is
constructed by interleaving the bits of the three integers forming the grid cell
index. Sorting data by morton code yields the desired coherent Z-order memory
layout described in Section 2.4.2.
After sorting, each grid cell's range of ray links (ray indices with cell codes) is
extracted in another compute shader pass. This pass simply compares the morton
codes of successive ray links. Whenever these differ, the end of the previous cell's
range and the beginning of the next cell's range have been found. Decoding the
morton codes yields the grid indices of each cell, which allows for the begin and
the end ray link indices to be stored in the respective cells of a volume texture.
Compacted ray inlining. During intersection testing all rays stored in a ray grid
cell need to be fetched and tested against each overlapping primitive. Following
the ray indices stored as ray links, this would repeatedly cause random indirect
memory accesses. To further increase memory access coherency, we therefore
compact and inline enqueued rays after sorting. Instead of following ray links
again and again for each primitive, we follow the ray links in every cell once up
front and store compacted clones of the referenced rays in an array parallel to
the ray link array.
For compacted storage, we pack the ray origins into cell-relative 21-bit triples.
The ray directions are packed into 16-bit tuples using the octahedron normal
vector encoding described in [Meyer et al. 10]. Together, these require a total of
three 32-bit integers per ray.
Persistent threads. During ray marching, we make use of the observations by
Aila et al. regarding GPU work distribution [Aila and Laine 09]: just like in their
BVH ray traversal algorithm, ray lengths in our ray marching stage may vary. On
current GPUs, the work distribution unit always waits for an entire unit of work
to be finished before distributing new units. For that reason, a single thread
working on a particularly long ray may block all other processing units in the
same processing group.
We therefore launch a fixed number of persistent GPU worker threads that
continually fetch new rays whenever ray-marching along a previous ray has been
finished. This alleviates work distribution delays. Due to the SIMD nature
of GPUs, the problem cannot be completely avoided, but we can at least take
advantage of more fine-grained scheduling mechanisms.
Search WWH ::




Custom Search