Hardware Reference
In-Depth Information
cy, the processor will redo the execution. The key to using this approach is that the processor
need only guarantee that the result would be the same as if all accesses were completed in or-
der, and it can achieve this by detecting when the results might difer. The approach is atract-
ive because the speculative restart will rarely be triggered. It will only be triggered when there
are unsynchronized accesses that actually cause a race [ Gharachorloo, Gupta, and Hennessy
1992 ] .
Hill [1998] advocated the combination of sequential or processor consistency together with
speculative execution as the consistency model of choice. His argument has three parts. First,
an aggressive implementation of either sequential consistency or processor consistency will
gain most of the advantage of a more relaxed model. Second, such an implementation adds
very litle to the implementation cost of a speculative processor. Third, such an approach al-
lows the programmer to reason using the simpler programming models of either sequential
or processor consistency. The MIPS R10000 design team had this insight in the mid-1990s and
used the R10000's out-of-order capability to support this type of aggressive implementation of
sequential consistency.
One open question is how successful compiler technology will be in optimizing memory ref-
erences to shared variables. The state of optimization technology and the fact that shared data
are often accessed via pointers or array indexing have limited the use of such optimizations.
If this technology became available and led to significant performance advantages, compiler
writers would want to be able to take advantage of a more relaxed programming model.
Inclusion And Its Implementation
All multiprocessors use multilevel cache hierarchies to reduce both the demand on the global
interconnect and the latency of cache misses. If the cache also provides multilevel inclu-
sion —every level of cache hierarchy is a subset of the level further away from the pro-
cessor—then we can use the multilevel structure to reduce the contention between coherence
traffic and processor traffic that occurs when snoops and processor cache accesses must con-
tend for the cache. Many multiprocessors with multilevel caches enforce the inclusion prop-
erty, although recent multiprocessors with smaller L1 caches and different block sizes have
sometimes chosen not to enforce inclusion. This restriction is also called the subset property be-
cause each cache is a subset of the cache below it in the hierarchy.
At first glance, preserving the multilevel inclusion property seems trivial. Consider a two-
level example: Any miss in L1 either hits in L2 or generates a miss in L2, causing it to be
brought into both L1 and L2. Likewise, any invalidate that hits in L2 must be sent to L1, where
it will cause the block to be invalidated if it exists.
The catch is what happens when the block sizes of L1 and L2 are different. Choosing dif-
ferent block sizes is quite reasonable, since L2 will be much larger and have a much longer
latency component in its miss penalty, and thus will want to use a larger block size. What hap-
pens to our “automatic” enforcement of inclusion when the block sizes differ? A block in L2
represents multiple blocks in L1, and a miss in L2 causes the replacement of data that is equi-
valent to multiple L1 blocks. For example, if the block size of L2 is four times that of L1, then a
miss in L2 will replace the equivalent of four L1 blocks. Let's consider a detailed example.
Example
Search WWH ::




Custom Search