Hardware Reference
In-Depth Information
FIGURE 5.19 The number of bytes needed per data reference grows as block size is in-
creased for both the kernel and user components . It is interesting to compare this chart
against the data on scientific programs shown in Appendix I.
For the multiprogrammed workload, the OS is a much more demanding user of the memory
system. If more OS or OS-like activity is included in the workload, and the behavior is similar
to what was measured for this workload, it will become very difficult to build a suiciently
capable memory system. One possible route to improving performance is to make the OS
more cache aware, through either beter programming environments or through programmer
assistance. For example, the OS reuses memory for requests that arise from different system
calls. Despite the fact that the reused memory will be completely overwriten, the hardware,
not recognizing this, will atempt to preserve coherency and the possibility that some portion
of a cache block may be read, even if it is not. This behavior is analogous to the reuse of stack
locations on procedure invocations. The IBM Power series has support to allow the compiler
to indicate this type of behavior on procedure invocations, and the newest AMD processors
have similar support. It is harder to detect such behavior by the OS, and doing so may require
programmer assistance, but the payoff is potentially even greater.
OS and commercial workloads pose tough challenges for multiprocessor memory systems,
and unlike scientific applications, which we examine in Appendix I, they are less amenable to
algorithmic or compiler restructuring. As the number of cores increases predicting the behavi-
or of such applications is likely to get more difficult. Emulation or simulation methodologies
that allow the simulation of hundreds of cores with large applications (including operating
systems) will be crucial to maintaining an analytical and quantitative approach to design.
5.4 Distributed Shared-Memory and Directory-Based
Coherence
As we saw in Section 5.2 , a snooping protocol requires communication with all caches on
every cache miss, including writes of potentially shared data. The absence of any centralized
data structure that tracks the state of the caches is both the fundamental advantage of a
 
 
Search WWH ::




Custom Search