Optimizing OpenCL Kernels for the ARM Mali-T600 GPUs - GPU Pro: Advanced Rendering Techniques - page 351

Graphics Reference

In-Depth Information

λ 1

1

2

4

8

16

32

64

128

Cache lines (NN)

513

258

132

72

48

48

72

132

L1 fraction (NN)

2.0

1.0

0.52

0.28

0.19

0.19

0.28

0.52

Cache lines (NT)

258

132

72

48

48

72

132

258

L1 fraction (NT)

1.0

0.52

0.28

0.19

0.19

0.28

0.52

1.0

Tab l e 7. 2. L1 cache utilization for the 1 × 4 × 4blockedNNandthe2 × 4 × 2blocked

NT kernels. We note that if we want to choose λ 0 = 4, we are restricted to λ 1 ≤ 32.

where m was incremented as an outer index to both i and j ,aswecreateallwork-

items in the first work-group before creating the first work-item in the second

work-group. We again share the accesses to A , and these four iterations over k

will need 2 λ 1 cache lines from A and 2 Mλ 0 cache lines from B ,andasbefore,

we have M = 128 / ( λ 0 λ 1 ), giving a total L1 usage of

2 λ 1 +2 λ 0 128

λ 0 λ 1 =2 λ 1 + 256

λ 1 .

If we compare with the result of the NN implementation, we get the L1 uti-

lization fractions shown in Table 7.2. By comparing with the results with the

previous ones, we see that while we had a preference for λ 1 =16or λ 1 =32for

the 1

×

×

×

×

4

4 blocked (NN) version, the 2

4

2 blocked (NT) implementation

works better with smaller work-groups.

7.5.8 L1 Cache Blocking

We saw above that the L1 cache utilization determined our robustness against

thread divergence, but every program will, if we do not interfere with thread

scheduling in any way, experience thread divergence. For large enough matrices,

this will always lead to performance degradations in the kernels we have discussed

so far. Our strategy to get around this issue is to introduce yet another level of

blocking and to rewrite the algorithm with this additional level of block-matrix

multiplication.

As a means of relating this level of blocking to the discussion about register

blocking, we now introduce a much larger Δ K , so that we have two: the Δ K reg .

introduced previously, and the new Δ K cache . After each set of Δ K cache iterations

in the loop, we reassemble all work-items in the work-group to ensure that no

thread divergence appears within the work-group.

Relating the change to the actual code, we insert a barrier operation between

every Δ K cache iterations in the loop. As this only limits thread divergence within

a work-group, we will still have to take into account the divergence between work-

groups that will limit L1 cache sharing between different work-groups. It therefore

seems as if we should expect that this method will work best when there is only

one simultaneously active work-group on each core. The full kernel is shown in

Listing 7.13.

Next Page

GPU Pro: Advanced Rendering Techniques

Search WWH ::

Custom Search

Home