The Computation Pipeline - Practical Rendering and Computation with Direct3D 11

Graphics Reference

In-Depth Information

CPU can be found in Chapter 7, "Multithreaded Rendering"). How can such a massive

number of threads be efficiently synchronized without losing the performance that the

GPU's parallelism provides? Fortunately, several different mechanisms are available for

synchronizing the threads of a thread group. We will explore each of these possibilities in

the following sections.

5.4.1 Memory Barriers

We will first look at the highest-level synchronization techniques, referred to as memory

barriers. HLSL provides a number of intrinsic functions that can be used to synchronize

memory accesses across all threads in a thread group. It is important to note that this is an

access mechanism that synchronizes only the threads within a thread group, and not across

an entire dispatch. These functions have two properties that differentiate them from one

another. The first is the class of memory that the threads are synchronizing across when the

function is called. It is possible to synchronize access to the group shared memory, device

memory, or both. The second property specifies whether all of the threads in a given thread

group are synchronized to the same point within their execution. These two properties pro-

vide a range of different synchronization behaviors for the developer to choose from. The

different versions of these intrinsic functions are listed in Table 5.1 below.

Without Group Synchronization

With Group Synchronization

GroupMemoryBarrierQ

GroupMemoryBarrier()

GroupMemoryBarrierWithGroupSync()

DeviceMemoryBarrier()

DeviceMemoryBarrierWithGroupSync()

DeviceMemoryBarrierWithGroupSyncQ

AllMemoryBarrier()

AHMemoryBarrierWithGroupSync()

AllMemoryBarrier()

Table 5.1. Intrinsic Functions: without and with group synchronization.

Each of these functions will block a thread from continuing until that function's par-

ticular conditions have been met. The first function, GroupMemoryBarrier(), blocks a

thread's execution until all writes to the group shared memory from all threads in a thread

group have been completed. This is used to ensure that when threads share data with one

another in the group shared memory that the desired values have had a chance to be written

into the group shared memory before being read by other threads. There is an important

distinction here between the shader core executing a write instruction, and that instruction

actually being carried out by the GPU's memory system and being written to memory,

where it would then be available again to other threads. Depending on the hardware imple-

mentation, there can be a variable amount of time between writing a value and when it

actually ends up at its destination. By performing a blocking operation until these writes

Search WWH ::

Custom Search

Home