Graphics Reference
In-Depth Information
we can qualitatively say that the longer the program runs, the stronger divergence
we see. This implies that the benefits of a low L1 cache utilization fraction have
a stronger impact on performance for large matrix sizes.
The innermost index. Now that we have discussed cache usage, we will return to
one aspect of the NN variant that we did not discuss before, namely the choice
to use j as the innermost index in the sequencing of operations. We have chosen
the order by using global_id(0) for the index j and global_id(1) for the index i
in our CL C kernels. The argument we will provide here holds only for the NN
versions, and is not applicable for the NT versions.
If we again consider the memory read sequence for the scalar NN variant, and
only look at a single iteration and a single work-group, we see
λ 1 1
A [ i,k ] , λ 1 1
B [ k,j ]
,
i =0
,
j =0
,
i =0
,
j =0
λ 0 1
λ 0 1
= λ 1 1
, λ 0 1
B [ k,j ]
,
i =0
,
j =0
A [ i,k ]
×
λ 0
×
λ 1 ,
but we could instead have swapped the roles of the global_id s and gotten
λ 1 1
A [ i,k ] , λ 1 1
B [ k,j ]
,
j =0
,
i =0
,
j =0
,
i =0
λ 0 1
λ 0 1
λ 0 1
A [ i,k ]
λ 1 , λ 1 1
.
,
i =0
,
j =0
×
B [ k,j ]
×
λ 0
We will now look for the number of cache-line switches in the execution of
those reads. For the former case, the reads from A are described by
,
λ 1 1
i =0
×
λ 0 , which switches cache line λ 1 times (and between those switches, we read
from the same matrix element λ 0 times). For different values of i we read from
different rows of the matrix, and separate rows are always on separate cache
lines. 20 Staying with the first version, but looking at B ,weseethatweswitch
matrix element λ 0 λ 1 times, as every read is from a different element than the
previous read, and we perform λ 0 λ 1 reads. We note, however, that we switch
between consecutive elements in memory, which means that we only have λ 0 λ 1 / 16
cache line switches, as we can fit 16 matrix elements into a cache line. For the
second version, with the indices in the other order, analogous considerations
show that we switch matrix element λ 0 λ 1 times for A and λ 1 times for B , but
A [ i,k ]
20 Except for very small matrices, of course, but those cases do not require cache analysis
anyway.
Search WWH ::




Custom Search