Hardware Reference
In-Depth Information
which eliminate the need for broadcast to all caches on a miss. As processor speeds and the
number of cores per processor increase, more designers are likely to opt for such protocols to
avoid the broadcast limit of a snooping protocol.
Implementing Snooping Cache Coherence
The devil is in the details.
Classic proverb
When we wrote the irst edition of this topic in 1990, our inal “Puting It All Together”
was a 30-processor, single-bus multiprocessor using snoop-based coherence; the bus had a ca-
pacity of just over 50 MB/sec, which would not be enough bus bandwidth to support even
one core of an Intel i7 in 2011! When we wrote the second edition of this topic in 1995, the
first cache coherence multiprocessors with more than a single bus had recently appeared, and
we added an appendix describing the implementation of snooping in a system with multiple
buses. In 2011, most multicore processors that support only a single-chip multiprocessor have
opted to use a shared bus structure connecting to either a shared memory or a shared cache.
In contrast, every multicore multiprocessor system that supports 16 or more cores uses an in-
terconnect other than a single bus, and designers must face the challenge of implementing
snooping without the simplification of a bus to serialize events.
As we said earlier, the major complication in actually implementing the snooping coherence
protocol we have described is that write and upgrade misses are not atomic in any recent mul-
tiprocessor. The steps of detecting a write or upgrade miss, communicating with the other
processors and memory, geting the most recent value for a write miss and ensuring that any
invalidates are processed, and updating the cache cannot be done as if they took a single cycle.
In a single multicore chip, these steps can be made effectively atomic by arbitrating for the
bus to the shared cache or memory first (before changing the cache state) and not releasing
the bus until all actions are complete. How can the processor know when all the invalidates
are complete? In some multicores, a single line is used to signal when all necessary invalidates
have been received and are being processed. Following that signal, the processor that gener-
ated the miss can release the bus, knowing that any required actions will be completed before
any activity related to the next miss. By holding the bus exclusively during these steps, the
processor effectively makes the individual steps atomic.
In a system without a bus, we must find some other method of making the steps in a miss
atomic. In particular, we must ensure that two processors that atempt to write the same block
at the same time, a situation which is called a race , are strictly ordered: One write is processed
and precedes before the next is begun. It does not mater which of two writes in a race wins
the race, just that there be only a single winner whose coherence actions are completed first. In
a snooping system, ensuring that a race has only one winner is accomplished by using broad-
cast for all misses as well as some basic properties of the interconnection network. These prop-
erties, together with the ability to restart the miss handling of the loser in a race, are the keys
to implementing snooping cache coherence without a bus. We explain the details in Appendix
I.
It is possible to combine snooping and directories, and several designs use snooping within
a multicore and directories among multiple chips or, vice versa , directories within a multicore
and snooping among multiple chips.
Search WWH ::




Custom Search