Memory Hierarchy Design - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

the same value of b as the last iteration. Ignoring potential conflict misses, the

misses due to b will be for b[j+1][0] accesses when i = 0, and also the first access

to b[j][0] when j = 0. Since j goes from 0 to 99 when i = 0, accesses to b lead to

100 + 1, or 101 misses.

Thus, this loop will miss the data cache approximately 150 times for a plus

101 times for b , or 251 misses.

To simplify our optimization, we will not worry about prefetching the first

accesses of the loop. These may already be in the cache, or we will pay the miss

penalty of the first few elements of a or b . Nor will we worry about suppress-

ing the prefetches at the end of the loop that try to prefetch beyond the end of a

( a[i][100] … a[i][106] ) and the end of b ( b[101][0] … b[107][0] ). If these were fault-

ing prefetches, we could not take this luxury. Let's assume that the miss penalty

is so large we need to start prefetching at least, say, seven iterations in advance.

(Stated alternatively, we assume prefetching has no benefit until the eighth iter-

ation.) We underline the changes to the code above needed to add prefetching.

for (j = 0; j < 100; j = j+1) {

prefetch(b[j+7][0]);

/* b(j,0) for 7 iterations later */

prefetch(a[0][j+7]);

/* a(0,j) for 7 iterations later */

a[0][j] = b[j][0] * b[j+1][0];};

for (i = 1; i < 3; i = i+1)

for (j = 0; j < 100; j = j+1) {

prefetch(a[i][j+7]);

/* a(i,j) for +7 iterations */

a[i][j] = b[j][0] * b[j+1][0];}

This revised code prefetches a[i][7] through a[i][99] and b[7][0] through

b[100][0] , reducing the number of nonprefetched misses to

■ 7 misses for elements b[0][0] , b[1][0] , …, b[6][0] in the first loop

■ 4 misses ([7/2]) for elements a[0][0] , a[0][1] , …, a[0][6] in the first loop (spa-

tial locality reduces misses to 1 per 16-byte cache block)

■ 4 misses ([7/2]) for elements a[1][0] , a[1][1] , …, a[1][6] in the second loop

■ 4 misses ([7/2]) for elements a[2][0] , a[2][1] , …, a[2][6] in the second loop

or a total of 19 nonprefetched misses. The cost of avoiding 232 cache misses is

executing 400 prefetch instructions, likely a good trade-off.

Example

Calculate the time saved in the example above. Ignore instruction cache misses

and assume there are no conflict or capacity misses in the data cache. Assume

that prefetches can overlap with each other and with cache misses, thereby

transferring at the maximum memory bandwidth. Here are the key loop times

Search WWH ::

Custom Search

Home