Hardware Reference
In-Depth Information
Determine if way selection improves performance per wat based on the estim-
ates from the study above.
Answer
For the I-cache, the savings in power is 25 × 0.28 = 0.07 of the total power, while
for the D-cache it is 15 × 0.35 = 0.05 for a total savings of 0.12. The way pre-
diction version requires 0.88 of the power requirement of the standard 4-way
cache. The increase in cache access time is the increase in I-cache average access
time plus one-half the increase in D-cache access time, or 1.04 + 0.5 × 0.13 = 1.11
times longer. This result means that way selection has 0.90 of the performance
of a standard four-way cache. Thus, way selection improves performance per
joule very slightly by a ratio of 0.90/0.88 = 1.02. This optimization is best used
where power rather than performance is the key objective.
Third Optimization: Pipelined Cache Access To Increase Cache
Bandwidth
This optimization is simply to pipeline cache access so that the effective latency of a irst-
level cache hit can be multiple clock cycles, giving fast clock cycle time and high bandwidth
but slow hits. For example, the pipeline for the instruction cache access for Intel Pentium
processors in the mid-1990s took 1 clock cycle, for the Pentium Pro through Pentium III in
the mid-1990s through 2000 it took 2 clocks, and for the Pentium 4, which became available
in 2000, and the current Intel Core i7 it takes 4 clocks. This change increases the number of
pipeline stages, leading to a greater penalty on mispredicted branches and more clock cycles
between issuing the load and using the data (see Chapter 3 ), but it does make it easier to in-
corporate high degrees of associativity.
Fourth Optimization: Nonblocking Caches To Increase Cache
Bandwidth
For pipelined computers that allow out-of-order execution (discussed in Chapter 3 ), the pro-
cessor need not stall on a data cache miss. For example, the processor could continue fetching
instructions from the instruction cache while waiting for the data cache to return the missing
data. A nonblocking cache or lockup-free cache escalates the potential benefits of such a scheme
by allowing the data cache to continue to supply cache hits during a miss. This “ hit under
miss” optimization reduces the effective miss penalty by being helpful during a miss instead
of ignoring the requests of the processor. A subtle and complex option is that the cache may
further lower the effective miss penalty if it can overlap multiple misses: a “ hit under multiple
miss” or “miss under miss” optimization. The second option is beneficial only if the memory
system can service multiple misses; most high-performance processors (such as the Intel Corei7)
if usually support both, while lower end processors, such as the ARM A8, provide only lim-
ited nonblocking support in L2.
To examine the effectiveness of nonblocking caches in reducing the cache miss penalty, Far-
kas and Jouppi [1994] did a study assuming 8 KB caches with a 14-cycle miss penalty; they
Search WWH ::




Custom Search