Hardware Reference
In-Depth Information
the same value of b as the last iteration. Ignoring potential conflict misses, the
misses due to b will be for b[j+1][0] accesses when i = 0, and also the first access
to b[j][0] when j = 0. Since j goes from 0 to 99 when i = 0, accesses to b lead to
100 + 1, or 101 misses.
Thus, this loop will miss the data cache approximately 150 times for a plus
101 times for b , or 251 misses.
To simplify our optimization, we will not worry about prefetching the first
accesses of the loop. These may already be in the cache, or we will pay the miss
penalty of the first few elements of a or b . Nor will we worry about suppress-
ing the prefetches at the end of the loop that try to prefetch beyond the end of a
( a[i][100] a[i][106] ) and the end of b ( b[101][0] b[107][0] ). If these were fault-
ing prefetches, we could not take this luxury. Let's assume that the miss penalty
is so large we need to start prefetching at least, say, seven iterations in advance.
(Stated alternatively, we assume prefetching has no benefit until the eighth iter-
ation.) We underline the changes to the code above needed to add prefetching.
for (j = 0; j < 100; j = j+1) {
prefetch(b[j+7][0]);
/* b(j,0) for 7 iterations later */
prefetch(a[0][j+7]);
/* a(0,j) for 7 iterations later */
a[0][j] = b[j][0] * b[j+1][0];};
for (i = 1; i < 3; i = i+1)
for (j = 0; j < 100; j = j+1) {
prefetch(a[i][j+7]);
/* a(i,j) for +7 iterations */
a[i][j] = b[j][0] * b[j+1][0];}
This revised code prefetches a[i][7] through a[i][99] and b[7][0] through
b[100][0] , reducing the number of nonprefetched misses to
■ 7 misses for elements b[0][0] , b[1][0] , …, b[6][0] in the first loop
■ 4 misses ([7/2]) for elements a[0][0] , a[0][1] , …, a[0][6] in the first loop (spa-
tial locality reduces misses to 1 per 16-byte cache block)
■ 4 misses ([7/2]) for elements a[1][0] , a[1][1] , …, a[1][6] in the second loop
■ 4 misses ([7/2]) for elements a[2][0] , a[2][1] , …, a[2][6] in the second loop
or a total of 19 nonprefetched misses. The cost of avoiding 232 cache misses is
executing 400 prefetch instructions, likely a good trade-off.
Example
Calculate the time saved in the example above. Ignore instruction cache misses
and assume there are no conflict or capacity misses in the data cache. Assume
that prefetches can overlap with each other and with cache misses, thereby
transferring at the maximum memory bandwidth. Here are the key loop times
Search WWH ::




Custom Search