Graphics Reference
In-Depth Information
work, it is necessary to have some methodology that allows us to efficiently map a par-
ticular algorithm to run on many threads. The typical multithreading paradigm used in
traditional CPU-based algorithms uses separate threads of execution, coupled with a shared
memory space and manual synchronization. It is a fairly well known model and has been
in use for many years on systems with multiple processors.
However, this model is not quite optimal when it is mapped onto a processing para-
digm that requires thousands of threads operating simultaneously. DirectCompute uses
a different type of threading and execution model, which attempts to provide a balance
between generality and ease of use. As we will see later in this section, this model allows
for easier mapping of threads to data elements. This provides a simple way to break a pro-
cessing task down into smaller pieces and have it run on the GPU.
5.2.1 Kernel Processing
Also like the other programmable shader stages, the compute shader implements a kernel
based processing system. The compute shader program itself is a function, which when
executed can be considered to be a form of a processing kernel. This means that the shader
program provides a kernel that will be used to process one unit of work. That unit of work
will vary from algorithm to algorithm, but the currently loaded kernel is instantiated and
applied to a single input set of data. In the case of the compute shader, the data set is pro-
vided through access to the appropriate resources bound to the compute shader stage.
This provides a very simple and intuitive way to program work for the thousands of
threads that the GPU is capable of operating on. Each thread will be tasked with executing
one individual invocation of the kernel on a particular data element. This simple concept
reduces the complexity of designing algorithms for many threads by changing the algo-
rithm design task from "What do I make each thread do?", to "Which piece of data does
each thread process?" When the kernel is the same for all threads, the task becomes finding
a data model that allows the desired data set to be broken into individual data elements
that can be processed in isolation, instead of manually trying to synchronize the actions
of all the threads. Instead of trying to orchestrate all the different responsibilities for each
individual thread, the developer can instead focus on the best way to split a problem into a
series of many instances of the same problem for every thread.
5.2.2 Dispatching Work
With an understanding of what the individual threads will be executing, we can continue
on to consider how the developer actually executes a batch of work with the desired
number of threads. This is performed with the device context ID3DllDeviceContext::
Dispatch() method, or the ID3DllDeviceContext: :DispatchIndirect() method,
Search WWH ::




Custom Search