Graphics Reference
In-Depth Information
the update fails, as negative values are always smaller than the given positive
distance. In this case, updating is temporarily skipped (line 18) and will be
retried in a subsequent loop iteration ( while checks the lock state in line 37).
Note that just waiting for the record to become unlocked by looping until all
other concurrent updates have been finished would not work. The SIMT exe-
cution model of current GPUs implies that threads skipping certain instructions
have to wait on other threads that execute these instructions until all threads
are back in sync and can continue to operate in lockstep . Therefore, it is impor-
tant to implement the spin lock by skipping instructions in the waiting threads.
Otherwise, waiting threads would actively hold back the unlocked threads from
performing the work they are waiting on.
In case of a successful atomic distance update, the entire record is locked using
InterlockedCompareExchange() (line 22). The distance of the hit point is passed
for comparison. If no closer point of intersection has been found in the meantime,
the negated distance will be written to the record, acquiring the lock on the hit
data. If the exchange fails, the hit point is discarded. In this case, a closer
point has already been found by some other thread. If the exchange succeeds, all
the other hit information is updated. Afterwards, the record is unlocked using
InterlockedExchange() to reset the distance attribute to the positive distance
value (line 32).
Input register pressure and incoherence. Our first implementation simply passed
entire triangles from geometry to pixel shader using many nointerpolation regis-
ters. This proved to be problematic in two ways. Firstly, as we exceeded a certain
number of pixel shader input registers, performance greatly deteriorated. High
register pressure limits the number of concurrent pixel shader threads that can
be started. Secondly, the number of rays enqueued per voxel varies. It turned
out to be too incoherent for SIMD parallelism to work on a per-voxel level: some
threads were looping through large numbers of rays while others were mostly
idling.
Load balancing using geometry shaders. To keep threads from idling, we imple-
mented a two-pass load balancing scheme that makes use of geometry shaders
to achieve full parallelism on a per-ray and per-triangle level. The scheme is
illustrated in Figure 2.9.
In the first pass, the current batch of triangles is voxelized. During this pass,
both the transformed triangles and pairs of ray grid cell and triangle indices for
all touched voxels are streamed out into auxiliary buffers. In the second pass,
all cell-triangle pairs are read in parallel using null-input vertex and geometry
shaders. The primitive ID (retrieved using the SV_PrimitiveID shader system
value input semantic) indicates which pair to process.
DirectX 11 provides DrawInstancedIndirect() to issue draw calls where the
number of vertices and instances resides in a GPU buffer. This allows us to trigger
Search WWH ::




Custom Search