Thread-Level Parallelism - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

cy, the processor will redo the execution. The key to using this approach is that the processor

need only guarantee that the result would be the same as if all accesses were completed in or-

der, and it can achieve this by detecting when the results might difer. The approach is atract-

ive because the speculative restart will rarely be triggered. It will only be triggered when there

are unsynchronized accesses that actually cause a race [ Gharachorloo, Gupta, and Hennessy

1992 ] .

Hill [1998] advocated the combination of sequential or processor consistency together with

speculative execution as the consistency model of choice. His argument has three parts. First,

an aggressive implementation of either sequential consistency or processor consistency will

gain most of the advantage of a more relaxed model. Second, such an implementation adds

very litle to the implementation cost of a speculative processor. Third, such an approach al-

lows the programmer to reason using the simpler programming models of either sequential

or processor consistency. The MIPS R10000 design team had this insight in the mid-1990s and

used the R10000's out-of-order capability to support this type of aggressive implementation of

sequential consistency.

One open question is how successful compiler technology will be in optimizing memory ref-

erences to shared variables. The state of optimization technology and the fact that shared data

are often accessed via pointers or array indexing have limited the use of such optimizations.

If this technology became available and led to significant performance advantages, compiler

writers would want to be able to take advantage of a more relaxed programming model.

Inclusion And Its Implementation

All multiprocessors use multilevel cache hierarchies to reduce both the demand on the global

interconnect and the latency of cache misses. If the cache also provides multilevel inclu-

sion —every level of cache hierarchy is a subset of the level further away from the pro-

cessor—then we can use the multilevel structure to reduce the contention between coherence

traffic and processor traffic that occurs when snoops and processor cache accesses must con-

tend for the cache. Many multiprocessors with multilevel caches enforce the inclusion prop-

erty, although recent multiprocessors with smaller L1 caches and different block sizes have

sometimes chosen not to enforce inclusion. This restriction is also called the subset property be-

cause each cache is a subset of the cache below it in the hierarchy.

At first glance, preserving the multilevel inclusion property seems trivial. Consider a two-

level example: Any miss in L1 either hits in L2 or generates a miss in L2, causing it to be

brought into both L1 and L2. Likewise, any invalidate that hits in L2 must be sent to L1, where

it will cause the block to be invalidated if it exists.

The catch is what happens when the block sizes of L1 and L2 are different. Choosing dif-

ferent block sizes is quite reasonable, since L2 will be much larger and have a much longer

latency component in its miss penalty, and thus will want to use a larger block size. What hap-

pens to our “automatic” enforcement of inclusion when the block sizes differ? A block in L2

represents multiple blocks in L1, and a miss in L2 causes the replacement of data that is equi-

valent to multiple L1 blocks. For example, if the block size of L2 is four times that of L1, then a

miss in L2 will replace the equivalent of four L1 blocks. Let's consider a detailed example.

Example

Search WWH ::

Custom Search

Home