Memory Hierarchy Design - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

The total latency of the instruction miss that is serviced by main memory is approximately

35 processor cycles to determine that an L3 miss has occurred, plus the DRAM latency for the

critical instructions. For a single-bank DDR1600 SDRAM and 3.3 GHz CPU, the DRAM latency

is about 35 ns or 100 clock cycles to the first 16 bytes, leading to a total miss penalty of 135

clock cycles. The memory controller fills the remainder of the 64-byte cache block at a rate of

16 bytes per memory clock cycle, which takes another 15 ns or 45 clock cycles.

Since the second-level cache is a write-back cache, any miss can lead to an old block being

written back to memory. The i7 has a 10-entry merging write buffer that writes back dirty

cache lines when the next level in the cache is unused for a read. The write buffer is snooped

by any miss to see if the cache line exists in the buffer; if so, the miss is filled from the buffer.

A similar buffer is used between the L1 and L2 caches.

If this initial instruction is a load, the data address is sent to the data cache and data TLBs,

acting very much like an instruction cache access with one key difference. The first-level data

cache is eight-way set associative, meaning that the index is 6 bits (versus 7 for the instruction

cache) and the address used to access the cache is the same as the page offset. Hence aliases in

the data cache are not a worry.

Suppose the instruction is a store instead of a load. When the store issues, it does a data

cache lookup just like a load. A miss causes the block to be placed in a write buffer, since the

L1 cache does not allocate the block on a write miss. On a hit, the store does not update the

L1 (or L2) cache until later, after it is known to be nonspeculative. During this time the store

resides in a load-store queue, part of the out-of-order control mechanism of the processor.

The I7 also supports prefetching for L1 and L2 from the next level in the hierarchy. In most

cases, the prefetched line is simply the next block in the cache. By prefetching only for L1 and

L2, high-cost unnecessary fetches to memory are avoided.

Performance of the i7 Memory System

We evaluate the performance of the i7 cache structure using 19 of the SPECCPU2006 bench-

marks (12 integer and 7 floating point), which were described in Chapter 1 . The data in this

section were collected by Professor Lu Peng and Ph.D. student Ying Zhang, both of Louisiana

State University.

We begin with the L1 cache. The 32 KB, four-way set associative instruction cache leads to

a very low instruction miss rate, especially because the instruction prefetch in the i7 is quite

efective. Of course, how we evaluate the miss rate is a bit tricky, since the i7 does not generate

individual requests for single instruction units, but instead prefetches 16 bytes of instruction

data (between four and five instructions typically). If, for simplicity, we examine the instruc-

tion cache miss rate as if single instruction references were handled, then the L1 instruction

cache miss rate varies from 0.1% to 1.8%, averaging just over 0.4%. This rate is in keeping with

other studies of instruction cache behavior for the SPECCPU2006 benchmarks, which showed

low instruction cache miss rates.

The L1 data cache is more interesting and even trickier to evaluate for three reasons:

1. Because the L1 data cache is not write allocated, writes can hit but never really miss, in the

sense that a write that does not hit simply places its data in the write buffer and does not

record as a miss.

2. Because speculation may sometimes be wrong (see Chapter 3 for an extensive discussion),

there are references to the L1 data cache that do not correspond to loads or stores that even-

tually complete execution. How should such misses be treated?

3. Finally, the L1 data cache does automatic prefetching. Should prefetches that miss be coun-

ted, and, if so, how?

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home