The Computation Pipeline - Practical Rendering and Computation with Direct3D 11

Graphics Reference

In-Depth Information

an implementation may or may not execute the the threading commands of the developer

precisely how they are declared. For example, when a thread group is declared, it can have

up to 1024 threads. From a programmatic point of view, all of these threads execute si-

multaneously. However, from a hardware perspective they may not all execute in parallel.

Specifically, if a particular GPU doesn't have 1024 processing cores, then it is impossible

for a complete thread group to be executed simultaneously.

Instead, the threads are executed in a manner that ensures that they behave as if they

were operating at the same time. For example, whenever a point in the shader program

requires a synchronization of all of the threads (synchronization is covered more later in

this chapter), then each subgroup of threads will be executed to the synchronization point,

and then are swapped out so that another subgroup can be executed to the same point. Only

after all of the threads in a thread group have completed up to this synchronization point

can they continue on. This method of operation can have some performance implications

if there is excessive synchronization points in the compute shader, but how much of an im-

pact will depend on the GPU hardware that it is executing on. As the number of processing

cores continues to increase, this will be come less and less of a performance issue.

5.3 DirectCompute Memory Model

The overall compute shader execution model provides a great deal of flexibility for in-

stantiating a suitable number of threads to execute the desired processing kernel on the

elements of a resource. It is easy to map a complete resource to a given number of threads

and perform some computation on each of its data elements. With this execution model in

mind, we will now turn our attention to what can be done within the compute shader itself.

We will investigate some of the unique features of the compute shader memory model that

give developers even more flexibility in deciding how to implement an algorithm.

5.3.1 Register-Based Memory

The compute shader runs on the same programmable processing hardware as the other

programmable shader stages. This means that it is based on the same general processing

paradigm and also implements the common shader core. It uses a register-based processing

concept similar to that for the other pipeline stages, with the exception that the computa-

tion pipeline only consists of a single stage. The set of registers that the computer shader

supports is quite similar to that of the other programmable stages, and supports input at-

tribute registers (v#), texture registers (t#), constant buffer registers (cb#), unordered reg-

isters (u#), and temporary registers (r#, x#). Since all shader programming is performed

Search WWH ::

Custom Search

Home