Thread-Level Parallelism - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

snooping-based scheme, since it allows it to be inexpensive, as well as its Achilles' heel when

it comes to scalability.

For example, consider a multiprocessor composed of four 4-core multicores capable of sus-

taining one data reference per clock and a 4 GHz clock. From the data in Section I.5 of Ap-

pendix I, we can see that the applications may require 4 GB/sec to 170 GB/sec of bus band-

width. Although the caches in those experiments are small, most of the traffic is coherence

traffic, which is unaffected by cache size. Although a modern bus might accommodate 4 GB/

sec, 170 GB/sec is far beyond the capability of any bus-based system. In the last few years, the

development of multicore processors forced all designers to shift to some form of distributed

memory to support the bandwidth demands of the individual processors.

We can increase the memory bandwidth and interconnection bandwidth by distributing the

memory, as shown in Figure 5.2 on page 348; this immediately separates local memory traffic

from remote memory traffic, reducing the bandwidth demands on the memory system and

on the interconnection network. Unless we eliminate the need for the coherence protocol to

broadcast on every cache miss, distributing the memory will gain us litle.

As we mentioned earlier, the alternative to a snooping-based coherence protocol is a direct-

ory protocol . A directory keeps the state of every block that may be cached. Information in the

directory includes which caches (or collections of caches) have copies of the block, whether it

is dirty, and so on. Within a multicore with a shared outermost cache (say, L3), it is easy to im-

plement a directory scheme: Simply keep a bit vector of the size equal to the number of cores

for each L3 block. The bit vector indicates which private caches may have copies of a block in

L3, and invalidations are only sent to those caches. This works perfectly for a single multicore

if L3 is inclusive, and this scheme is the one used in the Intel i7.

The solution of a single directory used in a multicore is not scalable, even though it avoids

broadcast. The directory must be distributed, but the distribution must be done in a way that

the coherence protocol knows where to find the directory information for any cached block of

memory. The obvious solution is to distribute the directory along with the memory, so that

different coherence requests can go to different directories, just as different memory requests

go to different memories. A distributed directory retains the characteristic that the sharing

status of a block is always in a single known location. This property, together with the main-

tenance of information that says what other nodes may be caching the block, is what allows

the coherence protocol to avoid broadcast. Figure 5.20 shows how our distributed-memory

multiprocessor looks with the directories added to each node.

Search WWH ::

Custom Search

Home