Object-Order Ray Tracing for Fully Dynamic Scenes - GPU Pro: Advanced Rendering Techniques

Graphics Reference

In-Depth Information

the update fails, as negative values are always smaller than the given positive

distance. In this case, updating is temporarily skipped (line 18) and will be

retried in a subsequent loop iteration ( while checks the lock state in line 37).

Note that just waiting for the record to become unlocked by looping until all

other concurrent updates have been finished would not work. The SIMT exe-

cution model of current GPUs implies that threads skipping certain instructions

have to wait on other threads that execute these instructions until all threads

are back in sync and can continue to operate in lockstep . Therefore, it is impor-

tant to implement the spin lock by skipping instructions in the waiting threads.

Otherwise, waiting threads would actively hold back the unlocked threads from

performing the work they are waiting on.

In case of a successful atomic distance update, the entire record is locked using

InterlockedCompareExchange() (line 22). The distance of the hit point is passed

for comparison. If no closer point of intersection has been found in the meantime,

the negated distance will be written to the record, acquiring the lock on the hit

data. If the exchange fails, the hit point is discarded. In this case, a closer

point has already been found by some other thread. If the exchange succeeds, all

the other hit information is updated. Afterwards, the record is unlocked using

InterlockedExchange() to reset the distance attribute to the positive distance

value (line 32).

Input register pressure and incoherence. Our first implementation simply passed

entire triangles from geometry to pixel shader using many nointerpolation regis-

ters. This proved to be problematic in two ways. Firstly, as we exceeded a certain

number of pixel shader input registers, performance greatly deteriorated. High

register pressure limits the number of concurrent pixel shader threads that can

be started. Secondly, the number of rays enqueued per voxel varies. It turned

out to be too incoherent for SIMD parallelism to work on a per-voxel level: some

threads were looping through large numbers of rays while others were mostly

idling.

Load balancing using geometry shaders. To keep threads from idling, we imple-

mented a two-pass load balancing scheme that makes use of geometry shaders

to achieve full parallelism on a per-ray and per-triangle level. The scheme is

illustrated in Figure 2.9.

In the first pass, the current batch of triangles is voxelized. During this pass,

both the transformed triangles and pairs of ray grid cell and triangle indices for

all touched voxels are streamed out into auxiliary buffers. In the second pass,

all cell-triangle pairs are read in parallel using null-input vertex and geometry

shaders. The primitive ID (retrieved using the SV_PrimitiveID shader system

value input semantic) indicates which pair to process.

DirectX 11 provides DrawInstancedIndirect() to issue draw calls where the

number of vertices and instances resides in a GPU buffer. This allows us to trigger

Search WWH ::

Custom Search

Home