Hardware Reference
In-Depth Information
typically implemented by marking the pages as requiring nonmerging write through by the
caches.
Eighth Optimization: Compiler Optimizations To Reduce Miss
Rate
Thus far, our techniques have required changing the hardware. This next technique reduces
miss rates without any hardware changes.
This magical reduction comes from optimized software—the hardware designer's favorite
solution! The increasing performance gap between processors and main memory has inspired
compiler writers to scrutinize the memory hierarchy to see if compile time optimizations
can improve performance. Once again, research is split between improvements in instruction
misses and improvements in data misses. The optimizations presented below are found in
many modern compilers.
Loop Interchange
Some programs have nested loops that access data in memory in nonsequential order. Simply
exchanging the nesting of the loops can make the code access the data in the order in which
they are stored. Assuming the arrays do not fit in the cache, this technique reduces misses
by improving spatial locality; reordering maximizes use of data in a cache block before they
are discarded. For example, if x is a two-dimensional array of size [5000,100] allocated so that
x[i,j] and x[i,j+1] are adjacent (an order called row major, since the array is laid out by rows),
then the two pieces of code below show how the accesses can be optimized:
/* Before */
for (j = 0; j < 100; j = j+1)
for (i = 0; i < 5000; i = i+1)
x[i][j] = 2 * x[i][j];
/* After */
for (i = 0; i < 5000; i = i+1)
for (j = 0; j < 100; j = j+1)
x[i][j] = 2 * x[i][j];
The original code would skip through memory in strides of 100 words, while the revised
version accesses all the words in one cache block before going to the next block. This optimiz-
ation improves cache performance without affecting the number of instructions executed.
Blocking
This optimization improves temporal locality to reduce misses. We are again dealing with
multiple arrays, with some arrays accessed by rows and some by columns. Storing the arrays
row by row ( row major order ) or column by column ( column major order ) does not solve the
problem because both rows and columns are used in every loop iteration. Such orthogonal
accesses mean that transformations such as loop interchange still leave plenty of room for im-
provement.
Instead of operating on entire rows or columns of an array, blocked algorithms operate on
submatrices or blocks . The goal is to maximize accesses to the data loaded into the cache before
the data are replaced. The code example below, which performs matrix multiplication, helps
motivate the optimization:
/* Before */
Search WWH ::




Custom Search