Graphics Reference
In-Depth Information
Because thread store is a scarce resource whose capacity has a significant
effect on performance (the processor remains stalled while the run queue is empty)
optimizations that reduce thread storage requirements are vigorously pursued. We
discuss two such optimizations, both of which are implemented by the GeForce
9800 GTX.
First we consider thread size. Because register contents constitute a signifi-
cant fraction of a thread's state, the size of a thread can be meaningfully reduced
by saving and restoring only the contents of “active” registers. In principle, reg-
ister activity could be tracked throughout shader execution so that thread-store
usage varied depending on the program counter of the blocked thread. In practice,
thread size is fixed, throughout the execution of a shader, to accommodate the
maximum number of registers that will be in use at any point during its execution.
Because the shader compiler is part of the GPU implementation (recall that shader
architecture is specified as a high-level language interface) the implementation not
only knows the peak register usage, but also can influence it as a compilation
optimization.
Thread size matters because the finite thread store holds more small threads
than large ones, decreasing the chances of the run queue becoming empty and the
processor stalling. While this relationship is easily understood, programmers are
sometimes surprised when their attempt to optimize shader performance by min-
imizing execution length (the number of instructions executed) reduces perfor-
mance rather than increasing it. Typically (compiled) shader length and register
usage trade off against each other—that is, shorter, heavily optimized programs
use more registers than longer, less optimized ones—hence, the counterintuitive
tuning results. Modern GPU shader compilers include heuristics to optimize this
tradeoff, but even experienced coders are sometimes confounded.
Performance may also be optimized by running threads longer, thereby keep-
ing more threads in the run queue. A naive scheduler might immediately block a
thread on a memory read (or an instruction such as tex1D that is known to read
data from memory) because it is rightly confident that the requested data will not
be available in the next cycle. But the requested data may not be required during
the next cycle—perhaps the thread will execute several instructions that do not
depend on the requested data before executing an instruction that does. A hard-
ware technique known as score boarding detects dependencies when they actu-
ally occur, allowing thread execution to continue until a dependency is reached,
thereby avoiding stalls by keeping more threads in the run queue. It is good pro-
gramming practice to sequence source code such that dependencies are pushed
forward in the code as far as possible, but shader compilers are optimized to find
such re-ordering opportunities regardless of the code's structure.
While hiding memory latency with multithreaded processor cores is a defining
trait of modern GPUs, the practice has a long history in CPU organization. The
CRAY-1 did not use multithreading, but the CDC 6600, an early-1960s Seymour
Cray design that preceded the CRAY-1, did [Tho61]. It included a register bar-
rel that implemented a combination of pipeline parallelism and multithreading,
rotating through ten threads, each at a different stage in the execution of its ten-
clock instruction cycle. The Stellar GS 1000, a graphics supercomputer built in the
late '80s, executed four threads in a round-robin order on its vector-processing
main CPU, which also accelerated graphics operations [ABM88]. Most Intel pro-
cessors in the IA-32 family implement “hyperthreading,” Intel's branded version
of multithreaded execution.
 
Search WWH ::




Custom Search