Hardware Reference
In-Depth Information
snooping-based scheme, since it allows it to be inexpensive, as well as its Achilles' heel when
it comes to scalability.
For example, consider a multiprocessor composed of four 4-core multicores capable of sus-
taining one data reference per clock and a 4 GHz clock. From the data in Section I.5 of Ap-
pendix I, we can see that the applications may require 4 GB/sec to 170 GB/sec of bus band-
width. Although the caches in those experiments are small, most of the traffic is coherence
traffic, which is unaffected by cache size. Although a modern bus might accommodate 4 GB/
sec, 170 GB/sec is far beyond the capability of any bus-based system. In the last few years, the
development of multicore processors forced all designers to shift to some form of distributed
memory to support the bandwidth demands of the individual processors.
We can increase the memory bandwidth and interconnection bandwidth by distributing the
memory, as shown in Figure 5.2 on page 348; this immediately separates local memory traffic
from remote memory traffic, reducing the bandwidth demands on the memory system and
on the interconnection network. Unless we eliminate the need for the coherence protocol to
broadcast on every cache miss, distributing the memory will gain us litle.
As we mentioned earlier, the alternative to a snooping-based coherence protocol is a direct-
ory protocol . A directory keeps the state of every block that may be cached. Information in the
directory includes which caches (or collections of caches) have copies of the block, whether it
is dirty, and so on. Within a multicore with a shared outermost cache (say, L3), it is easy to im-
plement a directory scheme: Simply keep a bit vector of the size equal to the number of cores
for each L3 block. The bit vector indicates which private caches may have copies of a block in
L3, and invalidations are only sent to those caches. This works perfectly for a single multicore
if L3 is inclusive, and this scheme is the one used in the Intel i7.
The solution of a single directory used in a multicore is not scalable, even though it avoids
broadcast. The directory must be distributed, but the distribution must be done in a way that
the coherence protocol knows where to find the directory information for any cached block of
memory. The obvious solution is to distribute the directory along with the memory, so that
different coherence requests can go to different directories, just as different memory requests
go to different memories. A distributed directory retains the characteristic that the sharing
status of a block is always in a single known location. This property, together with the main-
tenance of information that says what other nodes may be caching the block, is what allows
the coherence protocol to avoid broadcast. Figure 5.20 shows how our distributed-memory
multiprocessor looks with the directories added to each node.
Search WWH ::




Custom Search