Thread-Level Parallelism - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

which eliminate the need for broadcast to all caches on a miss. As processor speeds and the

number of cores per processor increase, more designers are likely to opt for such protocols to

avoid the broadcast limit of a snooping protocol.

Implementing Snooping Cache Coherence

The devil is in the details.

Classic proverb

When we wrote the irst edition of this topic in 1990, our inal “Puting It All Together”

was a 30-processor, single-bus multiprocessor using snoop-based coherence; the bus had a ca-

pacity of just over 50 MB/sec, which would not be enough bus bandwidth to support even

one core of an Intel i7 in 2011! When we wrote the second edition of this topic in 1995, the

first cache coherence multiprocessors with more than a single bus had recently appeared, and

we added an appendix describing the implementation of snooping in a system with multiple

buses. In 2011, most multicore processors that support only a single-chip multiprocessor have

opted to use a shared bus structure connecting to either a shared memory or a shared cache.

In contrast, every multicore multiprocessor system that supports 16 or more cores uses an in-

terconnect other than a single bus, and designers must face the challenge of implementing

snooping without the simplification of a bus to serialize events.

As we said earlier, the major complication in actually implementing the snooping coherence

protocol we have described is that write and upgrade misses are not atomic in any recent mul-

tiprocessor. The steps of detecting a write or upgrade miss, communicating with the other

processors and memory, geting the most recent value for a write miss and ensuring that any

invalidates are processed, and updating the cache cannot be done as if they took a single cycle.

In a single multicore chip, these steps can be made effectively atomic by arbitrating for the

bus to the shared cache or memory first (before changing the cache state) and not releasing

the bus until all actions are complete. How can the processor know when all the invalidates

are complete? In some multicores, a single line is used to signal when all necessary invalidates

have been received and are being processed. Following that signal, the processor that gener-

ated the miss can release the bus, knowing that any required actions will be completed before

any activity related to the next miss. By holding the bus exclusively during these steps, the

processor effectively makes the individual steps atomic.

In a system without a bus, we must find some other method of making the steps in a miss

atomic. In particular, we must ensure that two processors that atempt to write the same block

at the same time, a situation which is called a race , are strictly ordered: One write is processed

and precedes before the next is begun. It does not mater which of two writes in a race wins

the race, just that there be only a single winner whose coherence actions are completed first. In

a snooping system, ensuring that a race has only one winner is accomplished by using broad-

cast for all misses as well as some basic properties of the interconnection network. These prop-

erties, together with the ability to restart the miss handling of the loser in a race, are the keys

to implementing snooping cache coherence without a bus. We explain the details in Appendix

I.

It is possible to combine snooping and directories, and several designs use snooping within

a multicore and directories among multiple chips or, vice versa , directories within a multicore

and snooping among multiple chips.

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home