Memory Hierarchy Design - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

Either of these can be faulting or nonfaulting ; that is, the address does or does not cause an

exception for virtual address faults and protection violations. Using this terminology, a nor-

mal load instruction could be considered a “faulting register prefetch instruction.” Nonfault-

ing prefetches simply turn into no-ops if they would normally result in an exception, which is

what we want.

The most effective prefetch is “semantically invisible” to a program: It doesn't change the

contents of registers and memory, and it cannot cause virtual memory faults. Most processors

today offer nonfaulting cache prefetches. This section assumes nonfaulting cache prefetch,

also called nonbinding prefetch.

Prefetching makes sense only if the processor can proceed while prefetching the data; that

is, the caches do not stall but continue to supply instructions and data while waiting for the

prefetched data to return. As you would expect, the data cache for such computers is normally

nonblocking.

Like hardware-controlled prefetching, the goal is to overlap execution with the prefetching

of data. Loops are the important targets, as they lend themselves to prefetch optimizations. If

the miss penalty is small, the compiler just unrolls the loop once or twice, and it schedules the

prefetches with the execution. If the miss penalty is large, it uses software pipelining (see Ap-

pendix H) or unrolls many times to prefetch data for a future iteration.

Issuing prefetch instructions incurs an instruction overhead, however, so compilers must

take care to ensure that such overheads do not exceed the benefits. By concentrating on ref-

erences that are likely to be cache misses, programs can avoid unnecessary prefetches while

improving average memory access time significantly.

Example

For the code below, determine which accesses are likely to cause data cache

misses. Next, insert prefetch instructions to reduce misses. Finally, calculate the

number of prefetch instructions executed and the misses avoided by prefetch-

ing. Let's assume we have an 8 KB direct-mapped data cache with 16-byte

blocks, and it is a write-back cache that does write allocate. The elements of

a and b are 8 bytes long since they are double-precision floating-point arrays.

There are 3 rows and 100 columns for a and 101 rows and 3 columns for b . Let's

also assume they are not in the cache at the start of the program.

for (i = 0; i < 3; i = i+1)

for (j = 0; j < 100; j = j+1)

a[i][j] = b[j][0] * b[j+1][0];

Answer

The compiler will first determine which accesses are likely to cause cache

misses; otherwise, we will waste time on issuing prefetch instructions for data

that would be hits. Elements of a are writen in the order that they are stored

in memory, so a will benefit from spatial locality: The even values of j will miss

and the odd values will hit. Since a has 3 rows and 100 columns, its accesses will

lead to 3 × (100/2), or 150 misses.

The array b does not benefit from spatial locality since the accesses are not in

the order it is stored. The array b does benefit twice from temporal locality: The

same elements are accessed for each iteration of i , and each iteration of j uses

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home