Graphics Reference
In-Depth Information
we again step between consecutive memory locations in B , which means that
the number of cache-line switches is only λ 1 / 16 for B . Ifweaddthecacheline
switches together for A and B ,wehave λ 0 λ 1 / 16 + λ 1 and λ 0 λ 1 + λ 1 / 16 for the
two versions, respectively. With λ 0 and λ 1 between 4 and 64, the first version
will always need fewer cache-line switches than the latter. 21 In cases where we
do have cache misses (e.g., due to thread divergence), this should improve the
performance by reducing the concurrent cache needs within an iteration.
7.5.7 L1 Cache Analysis of Blocked NT Kernel
For the blocked NT kernel, we can analyze the L1 cache utilization in the same
way as for the NN kernel. We start by noting that work-item ( j, i ), in iteration
k , performs the memory accesses
A [2 i,k ] ,A [2 i +1 ,k ] ,B [2 j, k ] ,B [2 j +1 ,k ] ,
and we see that we should again consider four iterations over k , to get full cache
lines:
A [2 i,k +0] ,A [2 i +1 ,k +0] ,B [ j, k +0] ,B [2 j +1 ,k +0];
A [2 i,k +1] ,A [2 i +1 ,k +1] ,B [ j, k +1] ,B [2 j +1 ,k +1];
A [2 i,k +2] ,A [2 i +1 ,k +2] ,B [ j, k +2] ,B [2 j +1 ,k +2];
A [2 i,k +3] ,A [2 i +1 ,k +3] ,B [ j, k +3] ,B [2 j +1 ,k +3] .
During execution of the first four iterations, the first work-group, with its
λ 0 λ 1 work-items, accesses
k =0 λ 1 1
A [2 i,k ] , λ 1 1
A [2 i +1 ,k ] ,
,
i =0
,
j =0
,
i =0
,
j =0
λ 0 1
λ 0 1
4
λ 1 1
B [2 j, k ] , λ 1 1
B [2 j +1 ,k ]
,
i =0
,
j =0
,
i =0
,
j =0
λ 0
1
λ 0
1
and the first set of M simultaneous work-groups accesses
k =0 M− 1
A [2 i,k ] , M− 1
A [2 i +1 ,k ] ,
m =0 λ 1 1
,
i =0
,
j = 0
m =0 λ 1 1
,
i =0
,
j = 0
4
0 + λ 0 1
0 + λ 0 1
M− 1
B [2 j, k ] , M− 1
B [2 j +1 ,k ] ,
m =0 λ 1 1
,
i =0
,
j = 0
m =0 λ 1 1
,
i =0
,
j = 0
0 + λ 0 1
0 + λ 0 1
21 The roles of λ 0 and λ 1 are interchanged between the two versions, but the first one is still
always better.
Search WWH ::




Custom Search