Memory Hierarchy Design - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

■ To allow more outstanding misses at the lowest level of the cache (where the miss time is

the longest) requires supporting at least that many misses at a higher level, since the miss

must initiate at the highest level cache

■ The latency of the memory system

The following simplified example shows the key idea.

Example

Assume a main memory access time of 36 ns and a memory system capable of

a sustained transfer rate of 16 GB/sec. If the block size is 64 bytes, what is the

maximum number of outstanding misses we need to support assuming that we

can maintain the peak bandwidth given the request stream and that accesses

never conflict. If the probability of a reference colliding with one of the previous

four is 50%, and we assume that the access has to wait until the earlier access

completes, estimate the number of maximum outstanding references. For sim-

plicity, ignore the time between misses.

Answer

In the first case, assuming that we can maintain the peak bandwidth, the

memory system can support (16 × 10) 9 /64 = 250 million references per second.

Since each reference takes 36 ns, we can support 250 × 10 6 × 36 × 10 −9 = 9 ref-

erences. If the probability of a collision is greater than 0, then we need more

outstanding references, since we cannot start work on those references; the

memory system needs more independent references not fewer! To approximate

this, we can simply assume that half the memory references need not be issued

to the memory. This means that we must support twice as many outstanding

references, or 18.

In Li, Chen, Brockman, and Jouppi's study they found that the reduction in CPI for the in-

teger programs was about 7% for one hit under miss and about 12.7% for 64. For the loating

point programs, the reductions were 12.7% for one hit under miss and 17.8% for 64. These re-

ductions track fairly closely the reductions in the data cache access latency shown in Figure

2.5 .

Fifth Optimization: Multibanked Caches ToIncrease Cache

Bandwidth

Rather than treat the cache as a single monolithic block, we can divide it into independent

banks that can support simultaneous accesses. Banks were originally used to improve per-

formance of main memory and are now used inside modern DRAM chips as well as with

caches. The Arm Cortex-A8 supports one to four banks in its L2 cache; the Intel Core i7 has

four banks in L1 (to support up to 2 memory accesses per clock), and the L2 has eight banks.

Clearly, banking works best when the accesses naturally spread themselves across the

banks, so the mapping of addresses to banks affects the behavior of the memory system. A

Search WWH ::

Custom Search

Home