Modern Graphics Hardware - Computer Graphics: Principles and Practice

Graphics Reference

In-Depth Information

contents) that is modified during the execution of a task. The idea of multithread-

ing will not be new to you—it's related to ideas already discussed in Section 38.4.

There we learned that the (architecturally specified) programmable vertex, prim-

itive, and fragment processing stages are implemented with a single computation

engine that is shared between these distinct tasks. Multithreading is the name of

such a virtual-parallel implementation. When multiple tasks share a single pro-

cessor, the thread of the currently executing task is stored (and modified) in the

program counter and registers of the processor itself, while the threads of the

remaining tasks are saved unchanged in thread store. Changing which task is

executing on the processor involves swapping two threads: The thread of the cur-

rently active task is copied from the processor into thread store, and then the thread

of the next-to-execute task is copied from thread store into the processor.

Multithreading implementations are distinguished by their scheduling tech-

niques. Scheduling determines two things: when to swap threads, and which inac-

tive thread should become active. Interleaved scheduling cycles through threads

in a regular sequence, allocating a fixed (though not necessarily equal) number

of execution cycles to each thread in turn. Block scheduling executes the active

thread until it cannot be advanced, because it is waiting on an external dependency

such as a memory read operation, or an internal dependency such as a multicycle

ALU operation, and then swaps this thread for a thread that is runnable. Two

queues of thread IDs are maintained: a queue of blocked threads and a queue

of runnable threads. When a thread becomes blocked its ID is appended to the

blocked queue. IDs of threads that become unblocked are moved from the blocked

queue to the run queue as their status changes.

GPUs such as the GeForce 9800 GTX implement multithreading with a hier-

archical combination of interleaved and block scheduling. The GeForce 9800

GTX hardware enables zero-cycle replacement of blocked threads: No processor

cycles are lost during swaps. Because there is no performance penalty for swap-

ping threads, the GeForce 9800 GTX implements a simple static load balancing

by swapping every cycle, looping through the threads in the run queue. Load

balancing between tasks of different types—vertex, primitive, and fragment—

is implemented by including different proportions of these tasks in the mix of

threads that is executed on a single core. This per-thread-group load balancing

(the mix can't be changed during the execution of a group of such threads) is

adjusted from thread group to thread group based on queue depths for the various

task types.

Threads are small by the standards of DRAM capacity—on the order of 2,000

bytes each (roughly 128 bytes per vector element). This suggests that thread stores

would contain large numbers of threads, but in fact they do not. The GeForce

9800 GTX, for example, stores a maximum of 48 threads per processing core in

expensive, on-chip memory. Once again memory latency is the culprit. To sup-

port zero-cycle thread swaps, there must be near-immediate access to threads on

the run queue. Thus, thread store must have low latency, and is necessarily small

and local. (In fact, the GeForce 9800 GTX stores all threads in a single register

file. Threads are not swapped in and out at all; instead, the addressing of the reg-

ister is offset based on which thread is being executed.) More generally, because

multithreading compensates for DRAM latency, it is impractical for its implemen-

tation to experience (and be complicated by) that same latency. Thus, thread store

is an expensive and scarce resource.

Computer Graphics: Principles and Practice

Search WWH ::

Custom Search

Home