Hardware Reference
In-Depth Information
ignoring cache misses: The original loop takes 7 clock cycles per iteration, the
irst prefetch loop takes 9 clock cycles per iteration, and the second prefetch
loop takes 8 clock cycles per iteration (including the overhead of the outer for
loop). A miss takes 100 clock cycles.
Answer
The original doubly nested loop executes the multiply 3 × 100 or 300 times. Since
the loop takes 7 clock cycles per iteration, the total is 300 × 7 or 2100 clock cycles
plus cache misses. Cache misses add 251 × 100 or 25,100 clock cycles, giving a
total of 27,200 clock cycles. The first prefetch loop iterates 100 times; at 9 clock
cycles per iteration the total is 900 clock cycles plus cache misses. Now add 11 ×
100 or 1100 clock cycles for cache misses, giving a total of 2000. The second loop
executes 2 × 100 or 200 times, and at 8 clock cycles per iteration it takes 1600
clock cycles plus 8 × 100 or 800 clock cycles for cache misses. This gives a total of
2400 clock cycles. From the prior example, we know that this code executes 400
prefetch instructions during the 2000 + 2400 or 4400 clock cycles to execute these
two loops. If we assume that the prefetches are completely overlapped with the
rest of the execution, then the prefetch code is 27,200/4400, or 6.2 times faster.
Although array optimizations are easy to understand, modern programs are more likely to
use pointers. Luk and Mowry [1999] have demonstrated that compiler-based prefetching can
sometimes be extended to pointers as well. Of 10 programs with recursive data structures,
prefetching all pointers when a node is visited improved performance by 4% to 31% in half
of the programs. On the other hand, the remaining programs were still within 2% of their ori-
ginal performance. The issue is both whether prefetches are to data already in the cache and
whether they occur early enough for the data to arrive by the time it is needed.
Many processors support instructions for cache prefetch, and high-end processors (such as
the Intel Core i7) often also do some type of automated prefetch in hardware.
Cache Optimization Summary
The techniques to improve hit time, bandwidth, miss penalty, and miss rate generally afect
the other components of the average memory access equation as well as the complexity of
the memory hierarchy. Figure 2.11 summarizes these techniques and estimates the impact on
complexity, with + meaning that the technique improves the factor, - meaning it hurts that
factor, and blank meaning it has no impact. Generally, no technique helps more than one cat-
egory.
Search WWH ::




Custom Search