Graphics Reference
In-Depth Information
Like the memory barrier functions, these atomic functions can be used on group
shared memory as well as resource memory, which allows for a wide range of potential
uses. Each function performs an operation that can be used to turn the contents of either a
group shared memory location or a device resource location into a synchronization primi-
tive. For example, if a compute shader program wants to keep a count of the number of
threads that encounter a particular data value, then the total count can be initialized to zero,
and each thread can perform an InterlockedAdd() function on either a GSM location
(for the fastest access speed) or a resource (which persists between dispatch calls). These
atomic-style functions ensure that the total count will be incremented properly without any
overwriting of the intermediate values by different threads.
Since each of these functions provides a different type of operation, developers have
a significant amount of freedom to implement a desired type of synchronization. For ex-
ample, the InterlockedCompareExchange() function can be used to compare the value
of a destination to a reference value, and if the two match, a third argument could be writ-
ten to the destination. This functionality allows for the implementation of data that can be
"checked-out" by a thread and later checked back in for use by another thread. Since these
functions are very low level, they can be applied to a particular situation very flexibly. Each
function has its unique input requirements and operations, so the Direct3D 11 documenta-
tion should be referenced when selecting an appropriate function. These functions are also
available to the pixel shader stage, allowing it to also synchronize across resources (since
it doesn't have a group shared memory, it can only use the device resources for thread-to-
thread communication).
5.4.3 Implicit Synchronization
The final form of synchronization that we will discuss is actually not an explicit part of
the compute shader functionality. We will refer to this as implicit synchronization, which
occurs when an algorithm is designed to access its data set in such a way that there are no
potential interactions between threads. This is the preferred method of synchronization—
when no synchronization is needed! Each of the other two methods mentioned above is
effective and useful, but both come at some cost to performance. If an algorithm can access
a memory resource in an explicit and orchestrated manner, then no additional functions are
needed, and no extra thread context switching is needed either.
For example, in our simplified example program from earlier in the chapter, we
started by reading a value from a resource, then doubled it, and then stored it back to the
resource. In this case, no communication was needed from thread to thread, and hence
no synchronization was needed. Each thread can operate completely independently from
one another. Another example of this type of algorithm is the creation of a particle sys-
tem, where the particle state is stored in a structured buffer and accessed with an append/
Search WWH ::




Custom Search