Thread-Level Parallelism - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

is not an optimization. Rather, to ensure forward progress, protocol implementations must

ensure that they perform at least one CPU operation before relinquishing a block. Suppose

the coherence protocol implementation did not do this. Explain how this might lead to live-

lock. Give a simple code example that could stimulate this behavior.

5.17 [20/30] <5.4> Some directory protocols add an Owned (O) state to the protocol, similar

to the optimization discussed for snooping protocols. The Owned state behaves like the

Shared state in that nodes may only read Owned blocks, but it behaves like the Modified

state in that nodes must supply data on other nodes' Get requests to Owned blocks. The

Owned state eliminates the case where a GetShared request to a block in state Modified

requires the node to send the data to both the requesting processor and the memory. In a

MOSI directory protocol, a GetShared request to a block in either the Modified or Owned

states supplies data to the requesting node and transitions to the Owned state. A GetModi-

ied request in state Owned is handled like a request in state Modified. This optimized

MOSI protocol only updates memory when a node replaces a block in state Modified or

Owned.

a. [20] <5.4> Explain why the MSA state in the protocol is essentially a “transient”

Owned state.

b. [30] <5.4> Modify the cache and directory protocol tables to support a stable Owned

state.

5.18 [25/25] <5.4> The advanced directory protocol described above relies on a point-to-point

ordered interconnect to ensure correct operation. Assuming the initial cache contents of

Figure 5.38 and the following sequences of operations, explain what problem could ariseif

if the interconnect failed to maintain point-to-point ordering. Assume that the processors

perform the requests at the same time, but they are processed by the directory in the order

shown.

a. [25] <5.4> P1,0: read 110

P3,1: write 110 <-- 90

b. [25] <5.4> P1,0: read 110

P0,0: replace 110

Exercises

5.19 [15] <5.1> Assume that we have a function for an application of the form F ( i , p ), which

gives the fraction of time that exactly i processors are usable given that a total of p pro-

cessors is available. That means that

Assume that when i processors are in use, the applications run i times faster. Rewrite Am-

dahl's law so it gives the speedup as a function of p for some application.

5.20 [15/20/10] <5.1> In this exercise, we examine the effect of the interconnection network

topology on the clock cycles per instruction (CPI) of programs running on a 64-processor

distributed-memory multiprocessor. The processor clock rate is 3.3 GHz and the base CPI

Search WWH ::

Custom Search

Home