Graphics Reference
In-Depth Information
an implementation may or may not execute the the threading commands of the developer
precisely how they are declared. For example, when a thread group is declared, it can have
up to 1024 threads. From a programmatic point of view, all of these threads execute si-
multaneously. However, from a hardware perspective they may not all execute in parallel.
Specifically, if a particular GPU doesn't have 1024 processing cores, then it is impossible
for a complete thread group to be executed simultaneously.
Instead, the threads are executed in a manner that ensures that they behave as if they
were operating at the same time. For example, whenever a point in the shader program
requires a synchronization of all of the threads (synchronization is covered more later in
this chapter), then each subgroup of threads will be executed to the synchronization point,
and then are swapped out so that another subgroup can be executed to the same point. Only
after all of the threads in a thread group have completed up to this synchronization point
can they continue on. This method of operation can have some performance implications
if there is excessive synchronization points in the compute shader, but how much of an im-
pact will depend on the GPU hardware that it is executing on. As the number of processing
cores continues to increase, this will be come less and less of a performance issue.
5.3 DirectCompute Memory Model
The overall compute shader execution model provides a great deal of flexibility for in-
stantiating a suitable number of threads to execute the desired processing kernel on the
elements of a resource. It is easy to map a complete resource to a given number of threads
and perform some computation on each of its data elements. With this execution model in
mind, we will now turn our attention to what can be done within the compute shader itself.
We will investigate some of the unique features of the compute shader memory model that
give developers even more flexibility in deciding how to implement an algorithm.
5.3.1 Register-Based Memory
The compute shader runs on the same programmable processing hardware as the other
programmable shader stages. This means that it is based on the same general processing
paradigm and also implements the common shader core. It uses a register-based processing
concept similar to that for the other pipeline stages, with the exception that the computa-
tion pipeline only consists of a single stage. The set of registers that the computer shader
supports is quite similar to that of the other programmable stages, and supports input at-
tribute registers (v#), texture registers (t#), constant buffer registers (cb#), unordered reg-
isters (u#), and temporary registers (r#, x#). Since all shader programming is performed
Search WWH ::




Custom Search