Memory Hierarchy Design - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

typically implemented by marking the pages as requiring nonmerging write through by the

caches.

Eighth Optimization: Compiler Optimizations To Reduce Miss

Rate

Thus far, our techniques have required changing the hardware. This next technique reduces

miss rates without any hardware changes.

This magical reduction comes from optimized software—the hardware designer's favorite

solution! The increasing performance gap between processors and main memory has inspired

compiler writers to scrutinize the memory hierarchy to see if compile time optimizations

can improve performance. Once again, research is split between improvements in instruction

misses and improvements in data misses. The optimizations presented below are found in

many modern compilers.

Loop Interchange

Some programs have nested loops that access data in memory in nonsequential order. Simply

exchanging the nesting of the loops can make the code access the data in the order in which

they are stored. Assuming the arrays do not fit in the cache, this technique reduces misses

by improving spatial locality; reordering maximizes use of data in a cache block before they

are discarded. For example, if x is a two-dimensional array of size [5000,100] allocated so that

x[i,j] and x[i,j+1] are adjacent (an order called row major, since the array is laid out by rows),

then the two pieces of code below show how the accesses can be optimized:

/* Before */

for (j = 0; j < 100; j = j+1)

for (i = 0; i < 5000; i = i+1)

x[i][j] = 2 * x[i][j];

/* After */

for (i = 0; i < 5000; i = i+1)

for (j = 0; j < 100; j = j+1)

x[i][j] = 2 * x[i][j];

The original code would skip through memory in strides of 100 words, while the revised

version accesses all the words in one cache block before going to the next block. This optimiz-

ation improves cache performance without affecting the number of instructions executed.

Blocking

This optimization improves temporal locality to reduce misses. We are again dealing with

multiple arrays, with some arrays accessed by rows and some by columns. Storing the arrays

row by row ( row major order ) or column by column ( column major order ) does not solve the

problem because both rows and columns are used in every loop iteration. Such orthogonal

accesses mean that transformations such as loop interchange still leave plenty of room for im-

provement.

Instead of operating on entire rows or columns of an array, blocked algorithms operate on

submatrices or blocks . The goal is to maximize accesses to the data loaded into the cache before

the data are replaced. The code example below, which performs matrix multiplication, helps

motivate the optimization:

/* Before */

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home