Modern Graphics Hardware - Computer Graphics: Principles and Practice

Graphics Reference

In-Depth Information

triangle-interpolated texture coordinates, can also disrupt or destroy locality. The

resultant complications contributed to the delay of support for dependent texture

lookup, which was introduced to GPUs years after shader programmability was

first supported.

Thus far we have considered a single cache memory, but modern GPUs often

have two levels of cache, and CPUs even more (three, sometimes four). By con-

vention these are named L1 cache, L2 cache, . . . L n cache, counting outward from

the processor toward main memory. 15 L1 cache has the least capacity, but also the

least latency, and is optimized to interact well with the processor. L n cache has

the greatest capacity and the highest latency, and is optimized to interact well

with main memory. Worst-case latency may actually increase as levels of memory

hierarchy are added, due to the summation of multiple miss penalties, but overall

performance is improved.

In systems with multiple processor cores, caching is typically also parallelized.

The GeForce 9800 GTX GPU, for example, implements a separate L1 cache for

each pair of cores, and a separate L2 cache for each bank of memory (see Fig-

ure 38.4). Multiple L1 caches allow each to be tightly coupled with only two

processor cores, reducing latency by improving locality (each cache is physi-

cally closer to its cores) and by reducing access conflicts (each cache receives

requests from fewer cores). Pairing an L2 cache with each memory bank allows

each cache to aggregate accesses that map to its portion of main memory. Explicit

local memory is also parallelized—the GeForce 9800 GTX implements a separate

local memory for each core.

Recall that a key advantage of implicit local memory is the simplicity and

reliability of its programming model. Adding hierarchy does not compromise this

model: Although a single physical memory location may now be cached at mul-

tiple levels of the memory hierarchy, memory requests from multiple cores “see”

a consistent value because the requests are handled consistently. Parallel caches

(such as multiple L1 caches) potentially break the model, however, because they

are accessed and updated independently, so replications of a single physical mem-

ory location can become inconsistent. If this happens, the programmer's model of

the memory system has become incoherent, and the likelihood of programming

errors skyrockets.

Architects of parallel systems handle the cache coherence problem in one of

three ways.

1. Coherent memory: A coherent view of memory is enforced by adding

complexity to the memory hierarchy implementation. Cache-coherent pro-

tocols ensure that changes made to one data replica are broadcast or

otherwise transferred to other replicas, either immediately or as required.

This solution is expensive, both in implementation complexity and in the

inevitable reduction in performance.

2. Incoherent memory: An incoherent view of memory is accepted—

programmers must contend with the additional complexity this entails.

This solution is frugal in system implementation, but it is expensive due to

the likely reduction in programmer efficiency.

15. Abbreviations L1$, L2$, . . . L n $ are sometimes used informally, such as in figures.

Computer Graphics: Principles and Practice

Search WWH ::

Custom Search

Home