Thread-Level Parallelism - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

of an application with all references hiting in the cache is 0.5. Assume that 0.2% of the in-

structions involve a remote communication reference. The cost of a remote communication

reference is (100 + 10 h ) ns, where h is the number of communication network hops that a

remote reference has to make to the remote processor memory and back. Assume that all

communication links are bidirectional.

a. [15] <5.1> Calculate the worst-case remote communication cost when the 64 processors

are arranged as a ring, as an 8×8 processor grid, or as a hypercube. ( Hint : The longest

communication path on a 2 n hypercube has n links.)

b. [20] <5.1> Compare the base CPI of the application with no remote communication to

the CPI achieved with each of the three topologies in part (a).

c. [10] <5.1> How much faster is the application with no remote communication com-

pared to its performance with remote communication on each of the three topologies

in part (a).

5.21 [15] <5.2> Show how the basic snooping protocol of Figure 5.7 can be changed for a

write-through cache. What is the major hardware functionality that is not needed with a

write-through cache compared with a write-back cache?

5.22 [20] <5.2> Add a clean exclusive state to the basic snooping cache coherence protocol (

Figure 5.7 ). Show the protocol in the format of Figure 5.7 .

5.23 [15] <5.2> One proposed solution for the problem of false sharing is to add a valid bit

per word. This would allow the protocol to invalidate a word without removing the en-

tire block, leting a processor keep a portion of a block in its cache while another processor

writes a different portion of the block. What extra complications are introduced into the ba-

sic snooping cache coherence protocol ( Figure 5.7 ) if this capability is included? Remem-

ber to consider all possible protocol actions.

5.24 [15/20] <5.3> This exercise studies the impact of aggressive techniques to exploit

instruction-level parallelism in the processor when used in the design of shared-memory

multiprocessor systems. Consider two systems identical except for the processor. System

A uses a processor with a simple single-issue in-order pipeline, while system B uses a pro-

cessor with four-way issue, out-of-order execution, and a reorder buffer with 64 entries.

a. [15] <5.3> Following the convention of Figure 5.11 , let us divide the execution time into

instruction execution, cache access, memory access, and other stalls. How would you

expect each of these components to differ between system A and system B?

b. [10] <5.3> Based on the discussion of the behavior of the On-Line Transaction Pro-

cessing (OLTP) workload in Section 5.3 , what is the important difference between the

OLTP workload and other benchmarks that limits benefit from a more aggressive pro-

cessor design?

5.25 [15] <5.3> How would you change the code of an application to avoid false sharing?

What might be done by a compiler and what might require programmer directives?

5.26 [15] <5.4> Assume a directory-based cache coherence protocol. The directory currently

has information that indicates that processor P1 has the data in “exclusive” mode. If the

directory now gets a request for the same cache block from processor P1, what could this

mean? What should the directory controller do? (Such cases are called race conditions and

are the reason why coherence protocols are so difficult to design and verify.)

5.27 [20] <5.4> A directory controller can send invalidates for lines that have been replaced by

the local cache controller. To avoid such messages and to keep the directory consistent, re-

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home