Hardware Reference
In-Depth Information
given in Figure 5.35 . What is the resulting state (i.e., coherence state, tags, and data) of the
caches and memory after the given action? Show only the blocks that change; for example,
P0.B0: (I, 120, 00 01) indicates that CPU P0 is block B0 has the final state of I, tag of 120, and
data words 00 and 01. Also, what value is returned by each read operation?
a. [10] <5.2> P0: read 120
b. [10] <5.2> P0: write 120 <-- 80
c. [10] <5.2> P3: write 120 <-- 80
d. [10] <5.2> P1: read 110
e. [10] <5.2> P0: write 108 <-- 48
f. [10] <5.2> P0: write 130 <-- 78
g. [10] <5.2> P3: write 130 <-- 78
5.2 [20/20/20/20] <5.3> The performance of a snooping cache-coherent multiprocessor de-
pends on many detailed implementation issues that determine how quickly a cache re-
sponds with data in an exclusive or M state block. In some implementations, a CPU read
miss to a cache block that is exclusive in another processor's cache is faster than a miss to
a block in memory. This is because caches are smaller, and thus faster, than main memory.
Conversely, in some implementations, misses satisfied by memory are faster than those
satisfied by caches. This is because caches are generally optimized for “front side” or CPU
references, rather than “back side” or snooping accesses. For the multiprocessor illustrated
in Figure 5.35 , consider the execution of a sequence of operations on a single CPU where
■ CPU read and write hits generate no stall cycles.
■ CPU read and write misses generate N memory and N cache stall cycles if satisfied by
memory and cache, respectively.
■ CPU write hits that generate an invalidate incur N invalidate stall cycles.
■ A write-back of a block, due to either a conflict or another processor's request to an
exclusive block, incurs an additional N writeback stall cycles.
Consider two implementations with different performance characteristics summarized in Fig-
ure 5.36 . Consider the following sequence of operations assuming the initial cache state in Fig-
ure 5.35 . For simplicity, assume that the second operation begins after the first completes (even
though they are on different processors):
P1: read 110
P3: read 110
For Implementation 1, the first read generates 50 stall cycles because the read is satisfied by
P0's cache. P1 stalls for 40 cycles while it waits for the block, and P0 stalls for 10 cycles while
it writes the block back to memory in response to P1's request. Thus, the second read by P3
generates 100 stall cycles because its miss is satisfied by memory, and this sequence generates
a total of 150 stall cycles. For the following sequences of operations, how many stall cycles are
generated by each implementation?
a. [20] <5.3> P0: read 120
P0: read 128
P0: read 130
b. [20] <5.3> P0: read 100
P0: write 108 <-- 48
P0: write 130 <-- 78
 
Search WWH ::




Custom Search