The Computation Pipeline - Practical Rendering and Computation with Direct3D 11

Graphics Reference

In-Depth Information

work, it is necessary to have some methodology that allows us to efficiently map a par-

ticular algorithm to run on many threads. The typical multithreading paradigm used in

traditional CPU-based algorithms uses separate threads of execution, coupled with a shared

memory space and manual synchronization. It is a fairly well known model and has been

in use for many years on systems with multiple processors.

However, this model is not quite optimal when it is mapped onto a processing para-

digm that requires thousands of threads operating simultaneously. DirectCompute uses

a different type of threading and execution model, which attempts to provide a balance

between generality and ease of use. As we will see later in this section, this model allows

for easier mapping of threads to data elements. This provides a simple way to break a pro-

cessing task down into smaller pieces and have it run on the GPU.

5.2.1 Kernel Processing

Also like the other programmable shader stages, the compute shader implements a kernel

based processing system. The compute shader program itself is a function, which when

executed can be considered to be a form of a processing kernel. This means that the shader

program provides a kernel that will be used to process one unit of work. That unit of work

will vary from algorithm to algorithm, but the currently loaded kernel is instantiated and

applied to a single input set of data. In the case of the compute shader, the data set is pro-

vided through access to the appropriate resources bound to the compute shader stage.

This provides a very simple and intuitive way to program work for the thousands of

threads that the GPU is capable of operating on. Each thread will be tasked with executing

one individual invocation of the kernel on a particular data element. This simple concept

reduces the complexity of designing algorithms for many threads by changing the algo-

rithm design task from "What do I make each thread do?", to "Which piece of data does

each thread process?" When the kernel is the same for all threads, the task becomes finding

a data model that allows the desired data set to be broken into individual data elements

that can be processed in isolation, instead of manually trying to synchronize the actions

of all the threads. Instead of trying to orchestrate all the different responsibilities for each

individual thread, the developer can instead focus on the best way to split a problem into a

series of many instances of the same problem for every thread.

5.2.2 Dispatching Work

With an understanding of what the individual threads will be executing, we can continue

on to consider how the developer actually executes a batch of work with the desired

number of threads. This is performed with the device context ID3DllDeviceContext::

Dispatch() method, or the ID3DllDeviceContext: :DispatchIndirect() method,

Search WWH ::

Custom Search

Home