Graphics Reference
In-Depth Information
Read requests that are fulfilled by cached data (cache hits ) have dramatically
lower latency than those fulfilled frommain memory (cache misses ). Taking main-
memory latency as the benchmark, this disparity is desirable: If most memory
requests hit, latency is dramatically reduced. But it is tempting instead to take
the cache's hit latency as the benchmark, because this performance is achieved
asymptotically as the cache-miss rate goes to zero. Unfortunately, the large dis-
parity in latencies is undesirable from this viewpoint, because even a few misses
dramatically increase average latency. For example, the average latency of a cache
with a miss penalty of 100 x is doubled by a miss rate of only 1%. In practice, only
very large cache memories achieve average read latencies that approach this hit
latency.
Cache memory is organized into equal-size units called lines, which are typ-
ically much larger than a single data item. Transfers between the processor and
cache memory operate at the granularity of individual data items—a word is read
from a line in the cache and returned to the processor, or a byte is written from the
processor into the appropriate cache line. But transfers between cache and main
memory operate at cache-line granularity—entire cache lines are either read from
or written to main memory. Cache-line size is chosen so that these transfers make
efficient use of main-memory bandwidth. For example, cache lines may be as large
as the blocks in main memory, or at least a substantial fraction of this size. When
a cache read miss forces a line to be loaded from main memory, spatial locality
ensures that most if not all of the data items in that line will be accessed before the
line is overwritten by another. And caches can be designed to transfer lines back
to main memory infrequently ( write-back cache ) rather than immediately after
the processor writes a data item to the cache ( write-through cache ), minimizing
the main-memory bandwidth consumed by writing, and thereby maximizing the
main-memory bandwidth available for reading.
From the standpoint of the processor, cache memory addresses both of the key
concerns of main memory: Apparent memory latency is reduced, and apparent
memory bandwidth is increased. If cache memory size could be made arbitrarily
large, both apparent latency and apparent bandwidth could in principle be driven
to the point of diminishing return (i.e., to the point where further improvement
would not increase processor performance). In practice, cache size is limited to a
small fraction of the size of main memory, after which cache performance slows
to that of main memory. Because apparent latency increases quickly even for very
low miss rates, GPU implementations are typically tuned to achieve performance
that is unconstrained by memory bandwidth (assuming typical graphics loading)
with caches that are far too small to ensure the required latency. The (otherwise
unacceptable) apparent memory latency is hidden by multithreading, as described
in Section 38.6.3, rather than by outsized cache memories.
It is still possible for shader programmers to get in trouble, though, by
demanding more memory bandwidth than is available. For example, GPU tex-
ture interpolation performance is typically balanced assuming high data locality.
If this assumption is disrupted—if, for example, texture sample addresses spec-
ify disjoint, widely separated clusters of texels—then an excessively large num-
ber of memory blocks may be transferred from main memory to cache memory,
and shader performance can plummet. Undersampling a texture is one way to
create this situation. Thus, texture aliasing not only destroys image quality, it
can also destroy GPU performance! Dependent texture reads, meaning calls to
tex1D , tex2D ,or tex3D , with a parameter that is not directly derived from the
Search WWH ::




Custom Search