Graphics Reference
In-Depth Information
contents) that is modified during the execution of a task. The idea of multithread-
ing will not be new to you—it's related to ideas already discussed in Section 38.4.
There we learned that the (architecturally specified) programmable vertex, prim-
itive, and fragment processing stages are implemented with a single computation
engine that is shared between these distinct tasks. Multithreading is the name of
such a virtual-parallel implementation. When multiple tasks share a single pro-
cessor, the thread of the currently executing task is stored (and modified) in the
program counter and registers of the processor itself, while the threads of the
remaining tasks are saved unchanged in thread store. Changing which task is
executing on the processor involves swapping two threads: The thread of the cur-
rently active task is copied from the processor into thread store, and then the thread
of the next-to-execute task is copied from thread store into the processor.
Multithreading implementations are distinguished by their scheduling tech-
niques. Scheduling determines two things: when to swap threads, and which inac-
tive thread should become active. Interleaved scheduling cycles through threads
in a regular sequence, allocating a fixed (though not necessarily equal) number
of execution cycles to each thread in turn. Block scheduling executes the active
thread until it cannot be advanced, because it is waiting on an external dependency
such as a memory read operation, or an internal dependency such as a multicycle
ALU operation, and then swaps this thread for a thread that is runnable. Two
queues of thread IDs are maintained: a queue of blocked threads and a queue
of runnable threads. When a thread becomes blocked its ID is appended to the
blocked queue. IDs of threads that become unblocked are moved from the blocked
queue to the run queue as their status changes.
GPUs such as the GeForce 9800 GTX implement multithreading with a hier-
archical combination of interleaved and block scheduling. The GeForce 9800
GTX hardware enables zero-cycle replacement of blocked threads: No processor
cycles are lost during swaps. Because there is no performance penalty for swap-
ping threads, the GeForce 9800 GTX implements a simple static load balancing
by swapping every cycle, looping through the threads in the run queue. Load
balancing between tasks of different types—vertex, primitive, and fragment—
is implemented by including different proportions of these tasks in the mix of
threads that is executed on a single core. This per-thread-group load balancing
(the mix can't be changed during the execution of a group of such threads) is
adjusted from thread group to thread group based on queue depths for the various
task types.
Threads are small by the standards of DRAM capacity—on the order of 2,000
bytes each (roughly 128 bytes per vector element). This suggests that thread stores
would contain large numbers of threads, but in fact they do not. The GeForce
9800 GTX, for example, stores a maximum of 48 threads per processing core in
expensive, on-chip memory. Once again memory latency is the culprit. To sup-
port zero-cycle thread swaps, there must be near-immediate access to threads on
the run queue. Thus, thread store must have low latency, and is necessarily small
and local. (In fact, the GeForce 9800 GTX stores all threads in a single register
file. Threads are not swapped in and out at all; instead, the addressing of the reg-
ister is offset based on which thread is being executed.) More generally, because
multithreading compensates for DRAM latency, it is impractical for its implemen-
tation to experience (and be complicated by) that same latency. Thus, thread store
is an expensive and scarce resource.
Search WWH ::




Custom Search