Hardware Reference
In-Depth Information
for (i = 0; i < N; i = i+1)
for (j = 0; j < N; j = j+1)
{r = 0;
for (k = 0; k < N; k = k + 1)
r = r + y[i][k]*z[k][j];
x[i][j] = r;
};
The two inner loops read all
N
-by-
N
elements of
z
, read the same
N
elements in a row of
y
re-
the three arrays. A dark shade indicates a recent access, a light shade indicates an older access,
and white means not yet accessed.
FIGURE 2.8
A snapshot of the three arrays
x
,
y
, and
z
when
N
= 6 and
i
= 1
. The age of
accesses to the array elements is indicated by shade: white means not yet touched, light
means older accesses, and dark means newer accesses. Compared to
Figure 2.9
, elements
of
y
and
z
are read repeatedly to calculate new elements of
x
. The variables
i
,
j
, and
k
are
shown along the rows or columns used to access the arrays.
The number of capacity misses clearly depends on
N
and the size of the cache. If it can hold
all three
N
-by-
N
matrices, then all is well, provided there are no cache conflicts. If the cache can
hold one
N-by-N
matrix and one row of
N
, then at least the
i
ith row of
y
and the array
z
may stay
in the cache. Less than that and misses may occur for both
x
and
z
. In the worst case, there
would be 2
N
3
+
N
2
memory words accessed for
N
3
operations.
To ensure that the elements being accessed can it in the cache, the original code is changed
to compute on a submatrix of size
B
by
B
. Two inner loops now compute in steps of size
B
rather
than the full length of
x
and
z
.
B
is called the
blocking factor
. (Assume
x
is initialized to zero.)
/* After */
for (jj = 0; jj < N; jj = jj+B)
for (kk = 0; kk < N; kk = kk+B)
for (i = 0; i < N; i = i+1)
for (j = jj; j < min(jj+B,N); j = j+1)
{r = 0;
for (k = kk; k < min(kk+B,N); k = k + 1)
r = r + y[i][k]*z[k][j];
x[i][j] = x[i][j] + r;
Search WWH ::
Custom Search