Hardware Reference
In-Depth Information
Either of these can be faulting or nonfaulting ; that is, the address does or does not cause an
exception for virtual address faults and protection violations. Using this terminology, a nor-
mal load instruction could be considered a “faulting register prefetch instruction.” Nonfault-
ing prefetches simply turn into no-ops if they would normally result in an exception, which is
what we want.
The most effective prefetch is “semantically invisible” to a program: It doesn't change the
contents of registers and memory, and it cannot cause virtual memory faults. Most processors
today offer nonfaulting cache prefetches. This section assumes nonfaulting cache prefetch,
also called nonbinding prefetch.
Prefetching makes sense only if the processor can proceed while prefetching the data; that
is, the caches do not stall but continue to supply instructions and data while waiting for the
prefetched data to return. As you would expect, the data cache for such computers is normally
nonblocking.
Like hardware-controlled prefetching, the goal is to overlap execution with the prefetching
of data. Loops are the important targets, as they lend themselves to prefetch optimizations. If
the miss penalty is small, the compiler just unrolls the loop once or twice, and it schedules the
prefetches with the execution. If the miss penalty is large, it uses software pipelining (see Ap-
pendix H) or unrolls many times to prefetch data for a future iteration.
Issuing prefetch instructions incurs an instruction overhead, however, so compilers must
take care to ensure that such overheads do not exceed the benefits. By concentrating on ref-
erences that are likely to be cache misses, programs can avoid unnecessary prefetches while
improving average memory access time significantly.
Example
For the code below, determine which accesses are likely to cause data cache
misses. Next, insert prefetch instructions to reduce misses. Finally, calculate the
number of prefetch instructions executed and the misses avoided by prefetch-
ing. Let's assume we have an 8 KB direct-mapped data cache with 16-byte
blocks, and it is a write-back cache that does write allocate. The elements of
a and b are 8 bytes long since they are double-precision floating-point arrays.
There are 3 rows and 100 columns for a and 101 rows and 3 columns for b . Let's
also assume they are not in the cache at the start of the program.
for (i = 0; i < 3; i = i+1)
for (j = 0; j < 100; j = j+1)
a[i][j] = b[j][0] * b[j+1][0];
Answer
The compiler will first determine which accesses are likely to cause cache
misses; otherwise, we will waste time on issuing prefetch instructions for data
that would be hits. Elements of a are writen in the order that they are stored
in memory, so a will benefit from spatial locality: The even values of j will miss
and the odd values will hit. Since a has 3 rows and 100 columns, its accesses will
lead to 3 × (100/2), or 150 misses.
The array b does not benefit from spatial locality since the accesses are not in
the order it is stored. The array b does benefit twice from temporal locality: The
same elements are accessed for each iteration of i , and each iteration of j uses
Search WWH ::




Custom Search