Hardware Reference
In-Depth Information
};
Figure 2.9 illustrates the accesses to the three arrays using blocking. Looking only at capa-
city misses, the total number of memory words accessed is 2 N 3 / B + N 2 . This total is an improve-
ment by about a factor of B . Hence, blocking exploits a combination of spatial and temporal
locality, since y benefits from spatial locality and z benefits from temporal locality.
FIGURE 2.9 The age of accesses to the arrays x , y , and z when B = 3 . Note that, in con-
trast to Figure 2.8 , a smaller number of elements is accessed.
Although we have aimed at reducing cache misses, blocking can also be used to help re-
gister allocation. By taking a small blocking size such that the block can be held in registers,
we can minimize the number of loads and stores in the program.
As we shall see in Section 4.8 of Chapter 4, cache blocking is absolutely necessary to get
good performance from cache-based processors running applications using matrices as the
primary data structure.
Ninth Optimization: Hardware Prefetching Of Instructions And
Data To Reduce Miss Penalty Or Miss Rate
Nonblocking caches effectively reduce the miss penalty by overlapping execution with
memory access. Another approach is to prefetch items before the processor requests them.
Both instructions and data can be prefetched, either directly into the caches or into an external
buffer that can be more quickly accessed than main memory.
Instruction prefetch is frequently done in hardware outside of the cache. Typically, the pro-
cessor fetches two blocks on a miss: the requested block and the next consecutive block. The
requested block is placed in the instruction cache when it returns, and the prefetched block is
placed into the instruction stream buffer. If the requested block is present in the instruction
stream buffer, the original cache request is canceled, the block is read from the stream buffer,
and the next prefetch request is issued.
A similar approach can be applied to data accesses [ Jouppi 1990 ] . Palacharla and Kessler
[1994] looked at a set of scientific programs and considered multiple stream buffers that could
handle either instructions or data. They found that eight stream buffers could capture 50% to
70% of all misses from a processor with two 64 KB four-way set associative caches, one for in-
structions and the other for data.
 
Search WWH ::




Custom Search