Modern Graphics Hardware - Computer Graphics: Principles and Practice

Graphics Reference

In-Depth Information

Because thread store is a scarce resource whose capacity has a significant

effect on performance (the processor remains stalled while the run queue is empty)

optimizations that reduce thread storage requirements are vigorously pursued. We

discuss two such optimizations, both of which are implemented by the GeForce

9800 GTX.

First we consider thread size. Because register contents constitute a signifi-

cant fraction of a thread's state, the size of a thread can be meaningfully reduced

by saving and restoring only the contents of “active” registers. In principle, reg-

ister activity could be tracked throughout shader execution so that thread-store

usage varied depending on the program counter of the blocked thread. In practice,

thread size is fixed, throughout the execution of a shader, to accommodate the

maximum number of registers that will be in use at any point during its execution.

Because the shader compiler is part of the GPU implementation (recall that shader

architecture is specified as a high-level language interface) the implementation not

only knows the peak register usage, but also can influence it as a compilation

optimization.

Thread size matters because the finite thread store holds more small threads

than large ones, decreasing the chances of the run queue becoming empty and the

processor stalling. While this relationship is easily understood, programmers are

sometimes surprised when their attempt to optimize shader performance by min-

imizing execution length (the number of instructions executed) reduces perfor-

mance rather than increasing it. Typically (compiled) shader length and register

usage trade off against each other—that is, shorter, heavily optimized programs

use more registers than longer, less optimized ones—hence, the counterintuitive

tuning results. Modern GPU shader compilers include heuristics to optimize this

tradeoff, but even experienced coders are sometimes confounded.

Performance may also be optimized by running threads longer, thereby keep-

ing more threads in the run queue. A naive scheduler might immediately block a

thread on a memory read (or an instruction such as tex1D that is known to read

data from memory) because it is rightly confident that the requested data will not

be available in the next cycle. But the requested data may not be required during

the next cycle—perhaps the thread will execute several instructions that do not

depend on the requested data before executing an instruction that does. A hard-

ware technique known as score boarding detects dependencies when they actu-

ally occur, allowing thread execution to continue until a dependency is reached,

thereby avoiding stalls by keeping more threads in the run queue. It is good pro-

gramming practice to sequence source code such that dependencies are pushed

forward in the code as far as possible, but shader compilers are optimized to find

such re-ordering opportunities regardless of the code's structure.

While hiding memory latency with multithreaded processor cores is a defining

trait of modern GPUs, the practice has a long history in CPU organization. The

CRAY-1 did not use multithreading, but the CDC 6600, an early-1960s Seymour

Cray design that preceded the CRAY-1, did [Tho61]. It included a register bar-

rel that implemented a combination of pipeline parallelism and multithreading,

rotating through ten threads, each at a different stage in the execution of its ten-

clock instruction cycle. The Stellar GS 1000, a graphics supercomputer built in the

late '80s, executed four threads in a round-robin order on its vector-processing

main CPU, which also accelerated graphics operations [ABM88]. Most Intel pro-

cessors in the IA-32 family implement “hyperthreading,” Intel's branded version

of multithreaded execution.

Computer Graphics: Principles and Practice

Search WWH ::

Custom Search

Home