Thread-Level Parallelism - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

FIGURE 5.19 The number of bytes needed per data reference grows as block size is in-

creased for both the kernel and user components . It is interesting to compare this chart

against the data on scientific programs shown in Appendix I.

For the multiprogrammed workload, the OS is a much more demanding user of the memory

system. If more OS or OS-like activity is included in the workload, and the behavior is similar

to what was measured for this workload, it will become very difficult to build a suiciently

capable memory system. One possible route to improving performance is to make the OS

more cache aware, through either beter programming environments or through programmer

assistance. For example, the OS reuses memory for requests that arise from different system

calls. Despite the fact that the reused memory will be completely overwriten, the hardware,

not recognizing this, will atempt to preserve coherency and the possibility that some portion

of a cache block may be read, even if it is not. This behavior is analogous to the reuse of stack

locations on procedure invocations. The IBM Power series has support to allow the compiler

to indicate this type of behavior on procedure invocations, and the newest AMD processors

have similar support. It is harder to detect such behavior by the OS, and doing so may require

programmer assistance, but the payoff is potentially even greater.

OS and commercial workloads pose tough challenges for multiprocessor memory systems,

and unlike scientific applications, which we examine in Appendix I, they are less amenable to

algorithmic or compiler restructuring. As the number of cores increases predicting the behavi-

or of such applications is likely to get more difficult. Emulation or simulation methodologies

that allow the simulation of hundreds of cores with large applications (including operating

systems) will be crucial to maintaining an analytical and quantitative approach to design.

5.4 Distributed Shared-Memory and Directory-Based

Coherence

As we saw in Section 5.2 , a snooping protocol requires communication with all caches on

every cache miss, including writes of potentially shared data. The absence of any centralized

data structure that tracks the state of the caches is both the fundamental advantage of a

Search WWH ::

Custom Search

Home