Hardware Reference
In-Depth Information
of an application with all references hiting in the cache is 0.5. Assume that 0.2% of the in-
structions involve a remote communication reference. The cost of a remote communication
reference is (100 + 10 h ) ns, where h is the number of communication network hops that a
remote reference has to make to the remote processor memory and back. Assume that all
communication links are bidirectional.
a. [15] <5.1> Calculate the worst-case remote communication cost when the 64 processors
are arranged as a ring, as an 8×8 processor grid, or as a hypercube. ( Hint : The longest
communication path on a 2 n hypercube has n links.)
b. [20] <5.1> Compare the base CPI of the application with no remote communication to
the CPI achieved with each of the three topologies in part (a).
c. [10] <5.1> How much faster is the application with no remote communication com-
pared to its performance with remote communication on each of the three topologies
in part (a).
5.21 [15] <5.2> Show how the basic snooping protocol of Figure 5.7 can be changed for a
write-through cache. What is the major hardware functionality that is not needed with a
write-through cache compared with a write-back cache?
5.22 [20] <5.2> Add a clean exclusive state to the basic snooping cache coherence protocol (
Figure 5.7 ). Show the protocol in the format of Figure 5.7 .
5.23 [15] <5.2> One proposed solution for the problem of false sharing is to add a valid bit
per word. This would allow the protocol to invalidate a word without removing the en-
tire block, leting a processor keep a portion of a block in its cache while another processor
writes a different portion of the block. What extra complications are introduced into the ba-
sic snooping cache coherence protocol ( Figure 5.7 ) if this capability is included? Remem-
ber to consider all possible protocol actions.
5.24 [15/20] <5.3> This exercise studies the impact of aggressive techniques to exploit
instruction-level parallelism in the processor when used in the design of shared-memory
multiprocessor systems. Consider two systems identical except for the processor. System
A uses a processor with a simple single-issue in-order pipeline, while system B uses a pro-
cessor with four-way issue, out-of-order execution, and a reorder buffer with 64 entries.
a. [15] <5.3> Following the convention of Figure 5.11 , let us divide the execution time into
instruction execution, cache access, memory access, and other stalls. How would you
expect each of these components to differ between system A and system B?
b. [10] <5.3> Based on the discussion of the behavior of the On-Line Transaction Pro-
cessing (OLTP) workload in Section 5.3 , what is the important difference between the
OLTP workload and other benchmarks that limits benefit from a more aggressive pro-
cessor design?
5.25 [15] <5.3> How would you change the code of an application to avoid false sharing?
What might be done by a compiler and what might require programmer directives?
5.26 [15] <5.4> Assume a directory-based cache coherence protocol. The directory currently
has information that indicates that processor P1 has the data in “exclusive” mode. If the
directory now gets a request for the same cache block from processor P1, what could this
mean? What should the directory controller do? (Such cases are called race conditions and
are the reason why coherence protocols are so difficult to design and verify.)
5.27 [20] <5.4> A directory controller can send invalidates for lines that have been replaced by
the local cache controller. To avoid such messages and to keep the directory consistent, re-
Search WWH ::




Custom Search