Thread-Level Parallelism - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

given in Figure 5.35 . What is the resulting state (i.e., coherence state, tags, and data) of the

caches and memory after the given action? Show only the blocks that change; for example,

P0.B0: (I, 120, 00 01) indicates that CPU P0 is block B0 has the final state of I, tag of 120, and

data words 00 and 01. Also, what value is returned by each read operation?

a. [10] <5.2> P0: read 120

b. [10] <5.2> P0: write 120 <-- 80

c. [10] <5.2> P3: write 120 <-- 80

d. [10] <5.2> P1: read 110

e. [10] <5.2> P0: write 108 <-- 48

f. [10] <5.2> P0: write 130 <-- 78

g. [10] <5.2> P3: write 130 <-- 78

5.2 [20/20/20/20] <5.3> The performance of a snooping cache-coherent multiprocessor de-

pends on many detailed implementation issues that determine how quickly a cache re-

sponds with data in an exclusive or M state block. In some implementations, a CPU read

miss to a cache block that is exclusive in another processor's cache is faster than a miss to

a block in memory. This is because caches are smaller, and thus faster, than main memory.

Conversely, in some implementations, misses satisfied by memory are faster than those

satisfied by caches. This is because caches are generally optimized for “front side” or CPU

references, rather than “back side” or snooping accesses. For the multiprocessor illustrated

in Figure 5.35 , consider the execution of a sequence of operations on a single CPU where

■ CPU read and write hits generate no stall cycles.

■ CPU read and write misses generate N memory and N cache stall cycles if satisfied by

memory and cache, respectively.

■ CPU write hits that generate an invalidate incur N invalidate stall cycles.

■ A write-back of a block, due to either a conflict or another processor's request to an

exclusive block, incurs an additional N writeback stall cycles.

Consider two implementations with different performance characteristics summarized in Fig-

ure 5.36 . Consider the following sequence of operations assuming the initial cache state in Fig-

ure 5.35 . For simplicity, assume that the second operation begins after the first completes (even

though they are on different processors):

P1: read 110

P3: read 110

For Implementation 1, the first read generates 50 stall cycles because the read is satisfied by

P0's cache. P1 stalls for 40 cycles while it waits for the block, and P0 stalls for 10 cycles while

it writes the block back to memory in response to P1's request. Thus, the second read by P3

generates 100 stall cycles because its miss is satisfied by memory, and this sequence generates

a total of 150 stall cycles. For the following sequences of operations, how many stall cycles are

generated by each implementation?

a. [20] <5.3> P0: read 120

P0: read 128

P0: read 130

b. [20] <5.3> P0: read 100

P0: write 108 <-- 48

P0: write 130 <-- 78

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home