Memory Hierarchy Design - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

ignoring cache misses: The original loop takes 7 clock cycles per iteration, the

irst prefetch loop takes 9 clock cycles per iteration, and the second prefetch

loop takes 8 clock cycles per iteration (including the overhead of the outer for

loop). A miss takes 100 clock cycles.

Answer

The original doubly nested loop executes the multiply 3 × 100 or 300 times. Since

the loop takes 7 clock cycles per iteration, the total is 300 × 7 or 2100 clock cycles

plus cache misses. Cache misses add 251 × 100 or 25,100 clock cycles, giving a

total of 27,200 clock cycles. The first prefetch loop iterates 100 times; at 9 clock

cycles per iteration the total is 900 clock cycles plus cache misses. Now add 11 ×

100 or 1100 clock cycles for cache misses, giving a total of 2000. The second loop

executes 2 × 100 or 200 times, and at 8 clock cycles per iteration it takes 1600

clock cycles plus 8 × 100 or 800 clock cycles for cache misses. This gives a total of

2400 clock cycles. From the prior example, we know that this code executes 400

prefetch instructions during the 2000 + 2400 or 4400 clock cycles to execute these

two loops. If we assume that the prefetches are completely overlapped with the

rest of the execution, then the prefetch code is 27,200/4400, or 6.2 times faster.

Although array optimizations are easy to understand, modern programs are more likely to

use pointers. Luk and Mowry [1999] have demonstrated that compiler-based prefetching can

sometimes be extended to pointers as well. Of 10 programs with recursive data structures,

prefetching all pointers when a node is visited improved performance by 4% to 31% in half

of the programs. On the other hand, the remaining programs were still within 2% of their ori-

ginal performance. The issue is both whether prefetches are to data already in the cache and

whether they occur early enough for the data to arrive by the time it is needed.

Many processors support instructions for cache prefetch, and high-end processors (such as

the Intel Core i7) often also do some type of automated prefetch in hardware.

Cache Optimization Summary

The techniques to improve hit time, bandwidth, miss penalty, and miss rate generally afect

the other components of the average memory access equation as well as the complexity of

the memory hierarchy. Figure 2.11 summarizes these techniques and estimates the impact on

complexity, with + meaning that the technique improves the factor, - meaning it hurts that

factor, and blank meaning it has no impact. Generally, no technique helps more than one cat-

egory.

Search WWH ::

Custom Search

Home