Thread-Level Parallelism - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

Limitations In Symmetric Shared-Memory Multiprocessors And

Snooping Protocols

As the number of processors in a multiprocessor grows, or as the memory demands of each

processor grow, any centralized resource in the system can become a botleneck. Using the

higher bandwidth connection available on-chip and a shared L3 cache, which is faster than

memory, designers have managed to support four to eight high-performance cores in a sym-

metric fashion. Such an approach is unlikely to scale much past eight cores, and it will not

work once multiple multicores are combined.

Snooping bandwidth at the caches can also become a problem, since every cache must ex-

amine every miss placed on the bus. As we mentioned, duplicating the tags is one solution.

Another approach, which has been adopted in some recent multicores, is to place a directory

at the level of the outermost cache. The directory explicitly indicates which processor's caches

have copies of every item in the outermost cache. This is the approach Intel uses on the i7 and

Xeon 7000 series. Note that the use of this directory does not eliminate the botleneck due to

a shared bus and L3 among the processors, but it is much simpler to implement than the dis-

tributed directory schemes that we will examine in Section 5.4 .

How can a designer increase the memory bandwidth to support either more or faster pro-

cessors? To increase the communication bandwidth between processors and memory, de-

signers have used multiple buses as well as interconnection networks, such as crossbars or

small point-to-point networks. In such designs, the memory system (either main memory or

a shared cache) can be configured into multiple physical banks, so as to boost the efective

memory bandwidth while retaining uniform access time to memory. Figure 5.8 shows how

such a system might look if it where implemented with a single-chip multicore. Although such

an approach might be used to allow more than four cores to be interconnected on a single chip,

it does not scale well to a multichip multiprocessor that uses multicore building blocks, since

the memory is already atached to the individual multicore chips, rather than centralized.

Search WWH ::

Custom Search

Home