Hardware Reference
In-Depth Information
The total latency of the instruction miss that is serviced by main memory is approximately
35 processor cycles to determine that an L3 miss has occurred, plus the DRAM latency for the
critical instructions. For a single-bank DDR1600 SDRAM and 3.3 GHz CPU, the DRAM latency
is about 35 ns or 100 clock cycles to the first 16 bytes, leading to a total miss penalty of 135
clock cycles. The memory controller fills the remainder of the 64-byte cache block at a rate of
16 bytes per memory clock cycle, which takes another 15 ns or 45 clock cycles.
Since the second-level cache is a write-back cache, any miss can lead to an old block being
written back to memory. The i7 has a 10-entry merging write buffer that writes back dirty
cache lines when the next level in the cache is unused for a read. The write buffer is snooped
by any miss to see if the cache line exists in the buffer; if so, the miss is filled from the buffer.
A similar buffer is used between the L1 and L2 caches.
If this initial instruction is a load, the data address is sent to the data cache and data TLBs,
acting very much like an instruction cache access with one key difference. The first-level data
cache is eight-way set associative, meaning that the index is 6 bits (versus 7 for the instruction
cache) and the address used to access the cache is the same as the page offset. Hence aliases in
the data cache are not a worry.
Suppose the instruction is a store instead of a load. When the store issues, it does a data
cache lookup just like a load. A miss causes the block to be placed in a write buffer, since the
L1 cache does not allocate the block on a write miss. On a hit, the store does not update the
L1 (or L2) cache until later, after it is known to be nonspeculative. During this time the store
resides in a load-store queue, part of the out-of-order control mechanism of the processor.
The I7 also supports prefetching for L1 and L2 from the next level in the hierarchy. In most
cases, the prefetched line is simply the next block in the cache. By prefetching only for L1 and
L2, high-cost unnecessary fetches to memory are avoided.
Performance of the i7 Memory System
We evaluate the performance of the i7 cache structure using 19 of the SPECCPU2006 bench-
marks (12 integer and 7 floating point), which were described in Chapter 1 . The data in this
section were collected by Professor Lu Peng and Ph.D. student Ying Zhang, both of Louisiana
State University.
We begin with the L1 cache. The 32 KB, four-way set associative instruction cache leads to
a very low instruction miss rate, especially because the instruction prefetch in the i7 is quite
efective. Of course, how we evaluate the miss rate is a bit tricky, since the i7 does not generate
individual requests for single instruction units, but instead prefetches 16 bytes of instruction
data (between four and five instructions typically). If, for simplicity, we examine the instruc-
tion cache miss rate as if single instruction references were handled, then the L1 instruction
cache miss rate varies from 0.1% to 1.8%, averaging just over 0.4%. This rate is in keeping with
other studies of instruction cache behavior for the SPECCPU2006 benchmarks, which showed
low instruction cache miss rates.
The L1 data cache is more interesting and even trickier to evaluate for three reasons:
1. Because the L1 data cache is not write allocated, writes can hit but never really miss, in the
sense that a write that does not hit simply places its data in the write buffer and does not
record as a miss.
2. Because speculation may sometimes be wrong (see Chapter 3 for an extensive discussion),
there are references to the L1 data cache that do not correspond to loads or stores that even-
tually complete execution. How should such misses be treated?
3. Finally, the L1 data cache does automatic prefetching. Should prefetches that miss be coun-
ted, and, if so, how?
Search WWH ::




Custom Search