Graphics Reference
In-Depth Information
consume structured buffer resource object. In this case, each thread would read one par-
ticle's data using the consume intrinsic function. Since the thread doesn't know which
particle it is getting, it is completely independent of any other particle's data and can hence
execute independently of the other threads. After the particle is updated, it can be added
back into an append structured buffer with the append intrinsic function. There is no need
to synchronize between threads, and hence the individual GPU processing elements can
execute without managing any extra interthread communication. This type of particle sys-
tem is explored further in Chapter 12, "Simulations."
5.5 Algorithm Design
Throughout this chapter, we have learned about the various capabilities of the compute
shader. Some of the concepts we have seen are quite similar to the other programmable
shader stages, but some are quite different from what we have seen before. Indeed, the
concept of having a raw-computation-based shader stage is a completely new idea that has
been added in Direct3D 11. With the introduction of so much new functionality, it can be
somewhat difficult to approach a completely new algorithm and decide what tools to use to
implement it. This section aims to provide some general design guidelines that can be ap-
plied when developing an algorithm. Of course, there is no perfect methodology to design
an algorithm, so these guidelines should be taken as suggested starting points that can be
built on for a particular scenario.
5.5.1 Parallelism
The first area we will consider is how to maximize the parallelism of an algorithm. The
whole reason that the compute shader has been added to Direct3D 11 is to allow developers to
harness all of the CPU's available parallel processing power. We have alluded to this through-
out the chapter, but it should be an explicit design goal when developing an algorithm to run
in the compute shader. The data to be processed should be organized in such a way that it can
be processed with a minimal amount of memory access and computation, which will result in
a generally faster algorithm. If the problem can be broken down into smaller, coherent parts,
then the compute shader should be a good candidate for performing the calculations.
Minimize Synchronization
As discussed in the previous section, there are many ways to synchronize data between
threads. Group shared memory, device resources, atomic functions, and memory barriers all
provide different varieties of synchronization techniques. However, these synchronization
Search WWH ::




Custom Search