Hardware Reference
In-Depth Information
Least recently used (LRU)—To reduce the chance of throwing out information that will be
needed soon, accesses to blocks are recorded. Relying on the past to predict the future, the
block replaced is the one that has been unused for the longest time. LRU relies on a corol-
lary of locality: If recently used blocks are likely to be used again, then a good candidate
for disposal is the least recently used block.
First in, first out (FIFO)—Because LRU can be complicated to calculate, this approximates
LRU by determining the oldest block rather than the LRU.
A virtue of random replacement is that it is simple to build in hardware. As the number
of blocks to keep track of increases, LRU becomes increasingly expensive and is usually only
approximated. A common approximation (often called pseudo-LRU) has a set of bits for each
set in the cache with each bit corresponding to a single way (a way is bank in a set associat-
ive cache; there are four ways in four-way set associative cache) in the cache. When a set is
accessed, the bit corresponding to the way containing the desired block is turned on; if all the
bits associated with a set are turned on, they are reset with the exception of the most recently
turned on bit. When a block must be replaced, the processor chooses a block from the way
whose bit is turned of, often randomly if more than one choice is available. This approximates
LRU, since the block that is replaced will not have been accessed since the last time that all the
blocks in the set were accessed. Figure B.4 shows the difference in miss rates between LRU,
random, and FIFO replacement.
FIGURE B.4 Data cache misses per 1000 instructions comparing least recently used,
random, and first in, first out replacement for several sizes and associativities . There is
little difference between LRU and random for the largest size cache, with LRU outperforming
the others for smaller caches. FIFO generally outperforms random in the smaller cache sizes.
These data were collected for a block size of 64 bytes for the Alpha architecture using 10
SPEC2000 benchmarks. Five are from SPECint2000 (gap, gcc, gzip, mcf, and perl) and five
are from SPECfp2000 (applu, art, equake, lucas, and swim). We will use this computer and
these benchmarks in most figures in this appendix.
Q4: What Happens on a Write?
Reads dominate processor cache accesses. All instruction accesses are reads, and most instruc-
tions don't write to memory. Figures A.32 and A.33 in Appendix A suggest a mix of 10% stores
and 26% loads for MIPS programs, making writes 10%/(100% + 26% + 10%) or about 7% of
the overall memory traffic. Of the data cache traffic, writes are 10%/(26% + 10%) or about 28%.
Making the common case fast means optimizing caches for reads, especially since processors
 
Search WWH ::




Custom Search