Memory Hierarchy Design - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

Determine if way selection improves performance per wat based on the estim-

ates from the study above.

Answer

For the I-cache, the savings in power is 25 × 0.28 = 0.07 of the total power, while

for the D-cache it is 15 × 0.35 = 0.05 for a total savings of 0.12. The way pre-

diction version requires 0.88 of the power requirement of the standard 4-way

cache. The increase in cache access time is the increase in I-cache average access

time plus one-half the increase in D-cache access time, or 1.04 + 0.5 × 0.13 = 1.11

times longer. This result means that way selection has 0.90 of the performance

of a standard four-way cache. Thus, way selection improves performance per

joule very slightly by a ratio of 0.90/0.88 = 1.02. This optimization is best used

where power rather than performance is the key objective.

Third Optimization: Pipelined Cache Access To Increase Cache

Bandwidth

This optimization is simply to pipeline cache access so that the effective latency of a irst-

level cache hit can be multiple clock cycles, giving fast clock cycle time and high bandwidth

but slow hits. For example, the pipeline for the instruction cache access for Intel Pentium

processors in the mid-1990s took 1 clock cycle, for the Pentium Pro through Pentium III in

the mid-1990s through 2000 it took 2 clocks, and for the Pentium 4, which became available

in 2000, and the current Intel Core i7 it takes 4 clocks. This change increases the number of

pipeline stages, leading to a greater penalty on mispredicted branches and more clock cycles

between issuing the load and using the data (see Chapter 3 ), but it does make it easier to in-

corporate high degrees of associativity.

Fourth Optimization: Nonblocking Caches To Increase Cache

Bandwidth

For pipelined computers that allow out-of-order execution (discussed in Chapter 3 ), the pro-

cessor need not stall on a data cache miss. For example, the processor could continue fetching

instructions from the instruction cache while waiting for the data cache to return the missing

data. A nonblocking cache or lockup-free cache escalates the potential benefits of such a scheme

by allowing the data cache to continue to supply cache hits during a miss. This “ hit under

miss” optimization reduces the effective miss penalty by being helpful during a miss instead

of ignoring the requests of the processor. A subtle and complex option is that the cache may

further lower the effective miss penalty if it can overlap multiple misses: a “ hit under multiple

miss” or “miss under miss” optimization. The second option is beneficial only if the memory

system can service multiple misses; most high-performance processors (such as the Intel Corei7)

if usually support both, while lower end processors, such as the ARM A8, provide only lim-

ited nonblocking support in L2.

To examine the effectiveness of nonblocking caches in reducing the cache miss penalty, Far-

kas and Jouppi [1994] did a study assuming 8 KB caches with a 14-cycle miss penalty; they

Search WWH ::

Custom Search

Home