Hardware Reference
In-Depth Information
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k + 1)
r = r + y[i][k]*z[k][j];
x[i][j] = r;
};
The two inner loops read all N -by- N elements of z , read the same N elements in a row of y re-
peatedly, and write one row of N elements of x . Figure 2.8 gives a snapshot of the accesses to
the three arrays. A dark shade indicates a recent access, a light shade indicates an older access,
and white means not yet accessed.
FIGURE 2.8 A snapshot of the three arrays x , y , and z when N = 6 and i = 1 . The age of
accesses to the array elements is indicated by shade: white means not yet touched, light
means older accesses, and dark means newer accesses. Compared to Figure 2.9 , elements
of y and z are read repeatedly to calculate new elements of x . The variables i , j , and k are
shown along the rows or columns used to access the arrays.
The number of capacity misses clearly depends on N and the size of the cache. If it can hold
all three N -by- N matrices, then all is well, provided there are no cache conflicts. If the cache can
hold one N-by-N matrix and one row of N , then at least the i ith row of y and the array z may stay
in the cache. Less than that and misses may occur for both x and z . In the worst case, there
would be 2 N 3 + N 2 memory words accessed for N 3 operations.
To ensure that the elements being accessed can it in the cache, the original code is changed
to compute on a submatrix of size B by B . Two inner loops now compute in steps of size B rather
than the full length of x and z . B is called the blocking factor . (Assume x is initialized to zero.)
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B,N); k = k + 1)
r = r + y[i][k]*z[k][j];
x[i][j] = x[i][j] + r;
 
Search WWH ::




Custom Search