Graphics Reference
In-Depth Information
T HE MEMORY PRINCIPLE : The primary challenge of memory is coping with
access latency and limited bandwidth. Capacity is a secondary concern.
The way that GPUs such as the GeForce 9800 GTX handle this (sometimes)
huge memory latency is the subject of the next subsection.
38.6.3 Coping with Latency
Consider again the fragment shader of Listing 38.2. Ignoring execution of the
tex1D instruction, this shader makes five floating-point assignments and performs
four floating-point multiplications, for a total of nine operations. Again, ignor-
ing texture interpolation, we would expect its execution to require approximately
ten clock cycles, perhaps fewer if the GPU's data path implementation supports
hardware parallelism for short vector operations. (The GeForce 9800 GTX data
path does not.) Unfortunately, even if the GPU implementation provides separate
hardware for the computations required by image interpolation (the GeForce 9800
GTX, like most modern GPUs, does) the execution time of the texture-based frag-
ment shader could increase to hundreds or even thousands of cycles due to the
latency of the memory reads required to gather the values of the texels. Thus, the
performance of a naive implementation of this shader could be reduced by a factor
of ten to one hundred, or more, due to memory latency.
There are three potentially legitimate responses to this situation: 1) accept it;
2) take further steps to reduce memory latency; or 3) arrange for the system to do
something else while waiting on memory. Of course, options (2) and (3) may be
combined.
The engineering option of accepting a nonoptimal situation must always be
considered. Just as code optimization is best directed by thorough performance
profiling, hardware optimization is justified only by a significant improvement in
dynamic (i.e., real-world) performance. If texture interpolation were extremely
rare, its low performance would have little real-world effect. In fact, texture inter-
polation is ubiquitous in modern shaders, so its optimization is of paramount con-
cern to GPU implementors. Something must be done.
Once the memory controller has been optimized, further reduction of memory
latency may be achieved by caching. Briefly, caching augments a homogeneous,
large (and therefore distant and high-latency) memory with a hierarchy of var-
iously sized memories, the smallest placed nearest the requesting circuitry, and
the largest most distant from it. All modern GPUs implement caching for tex-
ture image interpolation. However, unlike CPUs such as the Intel Core 2 Extreme
QX9770, which depend primarily on their large (four-level) cache systems for
both memory bandwidth and latency optimization, GPU caches are large enough
to ensure that available memory bandwidth is utilized efficiently, but they are too
small to adequately reduce memory latency. We leave further discussion of the
important topic of caching to Section 38.7.2.
Because options (1) and (2) do not adequately address latency concerns, the
performance of GPUs such as the GeForce 9800 GTX depends heavily on their
implementation of option (3): arranging for the GPU to do something else while
waiting on memory. The technique they employ is called multithreading. A
thread is the dynamic, nonmemory state (such as the program counter and register
 
 
Search WWH ::




Custom Search