Hardware Reference
In-Depth Information
Limitations In Symmetric Shared-Memory Multiprocessors And
Snooping Protocols
As the number of processors in a multiprocessor grows, or as the memory demands of each
processor grow, any centralized resource in the system can become a botleneck. Using the
higher bandwidth connection available on-chip and a shared L3 cache, which is faster than
memory, designers have managed to support four to eight high-performance cores in a sym-
metric fashion. Such an approach is unlikely to scale much past eight cores, and it will not
work once multiple multicores are combined.
Snooping bandwidth at the caches can also become a problem, since every cache must ex-
amine every miss placed on the bus. As we mentioned, duplicating the tags is one solution.
Another approach, which has been adopted in some recent multicores, is to place a directory
at the level of the outermost cache. The directory explicitly indicates which processor's caches
have copies of every item in the outermost cache. This is the approach Intel uses on the i7 and
Xeon 7000 series. Note that the use of this directory does not eliminate the botleneck due to
a shared bus and L3 among the processors, but it is much simpler to implement than the dis-
tributed directory schemes that we will examine in Section 5.4 .
How can a designer increase the memory bandwidth to support either more or faster pro-
cessors? To increase the communication bandwidth between processors and memory, de-
signers have used multiple buses as well as interconnection networks, such as crossbars or
small point-to-point networks. In such designs, the memory system (either main memory or
a shared cache) can be configured into multiple physical banks, so as to boost the efective
memory bandwidth while retaining uniform access time to memory. Figure 5.8 shows how
such a system might look if it where implemented with a single-chip multicore. Although such
an approach might be used to allow more than four cores to be interconnected on a single chip,
it does not scale well to a multichip multiprocessor that uses multicore building blocks, since
the memory is already atached to the individual multicore chips, rather than centralized.
Search WWH ::




Custom Search