Graphics Reference
In-Depth Information
λ 1
1
2
4
8
16
32
64
128
Cache lines (NN)
513
258
132
72
48
48
72
132
L1 fraction (NN)
2.0
1.0
0.52
0.28
0.19
0.19
0.28
0.52
Cache lines (NT)
258
132
72
48
48
72
132
258
L1 fraction (NT)
1.0
0.52
0.28
0.19
0.19
0.28
0.52
1.0
Tab l e 7. 2. L1 cache utilization for the 1 × 4 × 4blockedNNandthe2 × 4 × 2blocked
NT kernels. We note that if we want to choose λ 0 = 4, we are restricted to λ 1 32.
where m was incremented as an outer index to both i and j ,aswecreateallwork-
items in the first work-group before creating the first work-item in the second
work-group. We again share the accesses to A , and these four iterations over k
will need 2 λ 1 cache lines from A and 2 0 cache lines from B ,andasbefore,
we have M = 128 / ( λ 0 λ 1 ), giving a total L1 usage of
2 λ 1 +2 λ 0 128
λ 0 λ 1 =2 λ 1 + 256
λ 1 .
If we compare with the result of the NN implementation, we get the L1 uti-
lization fractions shown in Table 7.2. By comparing with the results with the
previous ones, we see that while we had a preference for λ 1 =16or λ 1 =32for
the 1
×
×
×
×
4
4 blocked (NN) version, the 2
4
2 blocked (NT) implementation
works better with smaller work-groups.
7.5.8 L1 Cache Blocking
We saw above that the L1 cache utilization determined our robustness against
thread divergence, but every program will, if we do not interfere with thread
scheduling in any way, experience thread divergence. For large enough matrices,
this will always lead to performance degradations in the kernels we have discussed
so far. Our strategy to get around this issue is to introduce yet another level of
blocking and to rewrite the algorithm with this additional level of block-matrix
multiplication.
As a means of relating this level of blocking to the discussion about register
blocking, we now introduce a much larger Δ K , so that we have two: the Δ K reg .
introduced previously, and the new Δ K cache . After each set of Δ K cache iterations
in the loop, we reassemble all work-items in the work-group to ensure that no
thread divergence appears within the work-group.
Relating the change to the actual code, we insert a barrier operation between
every Δ K cache iterations in the loop. As this only limits thread divergence within
a work-group, we will still have to take into account the divergence between work-
groups that will limit L1 cache sharing between different work-groups. It therefore
seems as if we should expect that this method will work best when there is only
one simultaneously active work-group on each core. The full kernel is shown in
Listing 7.13.
Search WWH ::




Custom Search