Object-Order Ray Tracing for Fully Dynamic Scenes - GPU Pro: Advanced Rendering Techniques

Graphics Reference

In-Depth Information

During ray marching, enqueued rays are output in an arbitrary order, making

use of structured unordered access views enhanced with a global atomic counter

( D3D11_BUFFER_UAV_FLAG_COUNTER creation flag). A slot in the buffer is reserved

by calling RWStructuredBuffer.IncrementCounter() . The function yields an index

where the enqueued ray may be stored. Note that this is much more ecient than

using InterlockedAdd() on a counter value stored in a global memory buffer, as

IncrementCounter() is optimized for highly concurrent access.

For each occupied grid cell crossed during ray marching, we output the ray

index annotated with the morton code of the crossed cell. The morton code is

constructed by interleaving the bits of the three integers forming the grid cell

index. Sorting data by morton code yields the desired coherent Z-order memory

layout described in Section 2.4.2.

After sorting, each grid cell's range of ray links (ray indices with cell codes) is

extracted in another compute shader pass. This pass simply compares the morton

codes of successive ray links. Whenever these differ, the end of the previous cell's

range and the beginning of the next cell's range have been found. Decoding the

morton codes yields the grid indices of each cell, which allows for the begin and

the end ray link indices to be stored in the respective cells of a volume texture.

Compacted ray inlining. During intersection testing all rays stored in a ray grid

cell need to be fetched and tested against each overlapping primitive. Following

the ray indices stored as ray links, this would repeatedly cause random indirect

memory accesses. To further increase memory access coherency, we therefore

compact and inline enqueued rays after sorting. Instead of following ray links

again and again for each primitive, we follow the ray links in every cell once up

front and store compacted clones of the referenced rays in an array parallel to

the ray link array.

For compacted storage, we pack the ray origins into cell-relative 21-bit triples.

The ray directions are packed into 16-bit tuples using the octahedron normal

vector encoding described in [Meyer et al. 10]. Together, these require a total of

three 32-bit integers per ray.

Persistent threads. During ray marching, we make use of the observations by

Aila et al. regarding GPU work distribution [Aila and Laine 09]: just like in their

BVH ray traversal algorithm, ray lengths in our ray marching stage may vary. On

current GPUs, the work distribution unit always waits for an entire unit of work

to be finished before distributing new units. For that reason, a single thread

working on a particularly long ray may block all other processing units in the

same processing group.

We therefore launch a fixed number of persistent GPU worker threads that

continually fetch new rays whenever ray-marching along a previous ray has been

finished. This alleviates work distribution delays. Due to the SIMD nature

of GPUs, the problem cannot be completely avoided, but we can at least take

advantage of more fine-grained scheduling mechanisms.

Search WWH ::

Custom Search

Home