The Computation Pipeline - Practical Rendering and Computation with Direct3D 11

Graphics Reference

In-Depth Information

Like the memory barrier functions, these atomic functions can be used on group

shared memory as well as resource memory, which allows for a wide range of potential

uses. Each function performs an operation that can be used to turn the contents of either a

group shared memory location or a device resource location into a synchronization primi-

tive. For example, if a compute shader program wants to keep a count of the number of

threads that encounter a particular data value, then the total count can be initialized to zero,

and each thread can perform an InterlockedAdd() function on either a GSM location

(for the fastest access speed) or a resource (which persists between dispatch calls). These

atomic-style functions ensure that the total count will be incremented properly without any

overwriting of the intermediate values by different threads.

Since each of these functions provides a different type of operation, developers have

a significant amount of freedom to implement a desired type of synchronization. For ex-

ample, the InterlockedCompareExchange() function can be used to compare the value

of a destination to a reference value, and if the two match, a third argument could be writ-

ten to the destination. This functionality allows for the implementation of data that can be

"checked-out" by a thread and later checked back in for use by another thread. Since these

functions are very low level, they can be applied to a particular situation very flexibly. Each

function has its unique input requirements and operations, so the Direct3D 11 documenta-

tion should be referenced when selecting an appropriate function. These functions are also

available to the pixel shader stage, allowing it to also synchronize across resources (since

it doesn't have a group shared memory, it can only use the device resources for thread-to-

thread communication).

5.4.3 Implicit Synchronization

The final form of synchronization that we will discuss is actually not an explicit part of

the compute shader functionality. We will refer to this as implicit synchronization, which

occurs when an algorithm is designed to access its data set in such a way that there are no

potential interactions between threads. This is the preferred method of synchronization—

when no synchronization is needed! Each of the other two methods mentioned above is

effective and useful, but both come at some cost to performance. If an algorithm can access

a memory resource in an explicit and orchestrated manner, then no additional functions are

needed, and no extra thread context switching is needed either.

For example, in our simplified example program from earlier in the chapter, we

started by reading a value from a resource, then doubled it, and then stored it back to the

resource. In this case, no communication was needed from thread to thread, and hence

no synchronization was needed. Each thread can operate completely independently from

one another. Another example of this type of algorithm is the creation of a particle sys-

tem, where the particle state is stored in a structured buffer and accessed with an append/

Search WWH ::

Custom Search

Home