Graphics Reference
In-Depth Information
triangle-interpolated texture coordinates, can also disrupt or destroy locality. The
resultant complications contributed to the delay of support for dependent texture
lookup, which was introduced to GPUs years after shader programmability was
first supported.
Thus far we have considered a single cache memory, but modern GPUs often
have two levels of cache, and CPUs even more (three, sometimes four). By con-
vention these are named L1 cache, L2 cache, . . . L n cache, counting outward from
the processor toward main memory. 15 L1 cache has the least capacity, but also the
least latency, and is optimized to interact well with the processor. L n cache has
the greatest capacity and the highest latency, and is optimized to interact well
with main memory. Worst-case latency may actually increase as levels of memory
hierarchy are added, due to the summation of multiple miss penalties, but overall
performance is improved.
In systems with multiple processor cores, caching is typically also parallelized.
The GeForce 9800 GTX GPU, for example, implements a separate L1 cache for
each pair of cores, and a separate L2 cache for each bank of memory (see Fig-
ure 38.4). Multiple L1 caches allow each to be tightly coupled with only two
processor cores, reducing latency by improving locality (each cache is physi-
cally closer to its cores) and by reducing access conflicts (each cache receives
requests from fewer cores). Pairing an L2 cache with each memory bank allows
each cache to aggregate accesses that map to its portion of main memory. Explicit
local memory is also parallelized—the GeForce 9800 GTX implements a separate
local memory for each core.
Recall that a key advantage of implicit local memory is the simplicity and
reliability of its programming model. Adding hierarchy does not compromise this
model: Although a single physical memory location may now be cached at mul-
tiple levels of the memory hierarchy, memory requests from multiple cores “see”
a consistent value because the requests are handled consistently. Parallel caches
(such as multiple L1 caches) potentially break the model, however, because they
are accessed and updated independently, so replications of a single physical mem-
ory location can become inconsistent. If this happens, the programmer's model of
the memory system has become incoherent, and the likelihood of programming
errors skyrockets.
Architects of parallel systems handle the cache coherence problem in one of
three ways.
1. Coherent memory: A coherent view of memory is enforced by adding
complexity to the memory hierarchy implementation. Cache-coherent pro-
tocols ensure that changes made to one data replica are broadcast or
otherwise transferred to other replicas, either immediately or as required.
This solution is expensive, both in implementation complexity and in the
inevitable reduction in performance.
2. Incoherent memory: An incoherent view of memory is accepted—
programmers must contend with the additional complexity this entails.
This solution is frugal in system implementation, but it is expensive due to
the likely reduction in programmer efficiency.
15. Abbreviations L1$, L2$, . . . L n $ are sometimes used informally, such as in figures.
Search WWH ::




Custom Search