Thread-Level Parallelism - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

that appendix, we examine the nature of such applications and the challenges of achieving

speedup with dozens to hundreds of processors.

5.2 Centralized Shared-Memory Architectures

The observation that the use of large, multilevel caches can substantially reduce the memory

bandwidth demands of a processor is the key insight that motivates centralized memory mul-

tiprocessors. Originally, these processors were all single-core and often took an entire board,

and memory was located on a shared bus. With more recent, higher-performance processors,

the memory demands have outstripped the capability of reasonable buses, and recent mi-

croprocessors directly connect memory to a single chip, which is sometimes called a backside

or memory bus to distinguish it from the bus used to connect to I/O. Accessing a chip's loc-

all memory whether for an I/O operation or for an access from another chip requires going

through the chip that “owns” that memory. Thus, access to memory is asymmetric: faster to

the local memory and slower to the remote memory. In a multicore that memory is shared

among all the cores on a single chip, but the asymmetric access to the memory of one mul-

ticore from the memory of another remains.

Symmetric shared-memory machines usually support the caching of both shared and

private data. Private data are used by a single processor, while shared data are used by multiple

processors, essentially providing communication among the processors through reads and

writes of the shared data. When a private item is cached, its location is migrated to the

cache, reducing the average access time as well as the memory bandwidth required. Since

no other processor uses the data, the program behavior is identical to that in a uniprocessor.

When shared data are cached, the shared value may be replicated in multiple caches. In addi-

tion to the reduction in access latency and required memory bandwidth, this replication also

provides a reduction in contention that may exist for shared data items that are being read by

multiple processors simultaneously. Caching of shared data, however, introduces a new prob-

lem: cache coherence.

What Is Multiprocessor Cache Coherence?

Unfortunately, caching shared data introduces a new problem because the view of memory

held by two different processors is through their individual caches, which, without any addi-

tional precautions, could end up seeing two different values. Figure 5.3 illustrates the problem

and shows how two different processors can have two different values for the same location.

This difficulty is generally referred to as the cache coherence problem . Notice that the coherence

problem exists because we have both a global state, defined primarily by the main memory,

and a local state, defined by the individual caches, which are private to each processor core.

Thus, in a multicore where some level of caching may be shared (for example, an L3), while

some levels are private (for example, L1 and L2), the coherence problem still exists and must

be solved.

Search WWH ::

Custom Search

Home