Hardware Reference
In-Depth Information
■ To allow more outstanding misses at the lowest level of the cache (where the miss time is
the longest) requires supporting at least that many misses at a higher level, since the miss
must initiate at the highest level cache
■ The latency of the memory system
The following simplified example shows the key idea.
Example
Assume a main memory access time of 36 ns and a memory system capable of
a sustained transfer rate of 16 GB/sec. If the block size is 64 bytes, what is the
maximum number of outstanding misses we need to support assuming that we
can maintain the peak bandwidth given the request stream and that accesses
never conflict. If the probability of a reference colliding with one of the previous
four is 50%, and we assume that the access has to wait until the earlier access
completes, estimate the number of maximum outstanding references. For sim-
plicity, ignore the time between misses.
Answer
In the first case, assuming that we can maintain the peak bandwidth, the
memory system can support (16 × 10) 9 /64 = 250 million references per second.
Since each reference takes 36 ns, we can support 250 × 10 6 × 36 × 10 −9 = 9 ref-
erences. If the probability of a collision is greater than 0, then we need more
outstanding references, since we cannot start work on those references; the
memory system needs more independent references not fewer! To approximate
this, we can simply assume that half the memory references need not be issued
to the memory. This means that we must support twice as many outstanding
references, or 18.
In Li, Chen, Brockman, and Jouppi's study they found that the reduction in CPI for the in-
teger programs was about 7% for one hit under miss and about 12.7% for 64. For the loating
point programs, the reductions were 12.7% for one hit under miss and 17.8% for 64. These re-
ductions track fairly closely the reductions in the data cache access latency shown in Figure
2.5 .
Fifth Optimization: Multibanked Caches ToIncrease Cache
Bandwidth
Rather than treat the cache as a single monolithic block, we can divide it into independent
banks that can support simultaneous accesses. Banks were originally used to improve per-
formance of main memory and are now used inside modern DRAM chips as well as with
caches. The Arm Cortex-A8 supports one to four banks in its L2 cache; the Intel Core i7 has
four banks in L1 (to support up to 2 memory accesses per clock), and the L2 has eight banks.
Clearly, banking works best when the accesses naturally spread themselves across the
banks, so the mapping of addresses to banks affects the behavior of the memory system. A
Search WWH ::




Custom Search