Optimizing OpenCL Kernels for the ARM Mali-T600 GPUs - GPU Pro: Advanced Rendering Techniques

Graphics Reference

In-Depth Information

we again step between consecutive memory locations in B , which means that

the number of cache-line switches is only λ 1 / 16 for B . Ifweaddthecacheline

switches together for A and B ,wehave λ 0 λ 1 / 16 + λ 1 and λ 0 λ 1 + λ 1 / 16 for the

two versions, respectively. With λ 0 and λ 1 between 4 and 64, the first version

will always need fewer cache-line switches than the latter. 21 In cases where we

do have cache misses (e.g., due to thread divergence), this should improve the

performance by reducing the concurrent cache needs within an iteration.

7.5.7 L1 Cache Analysis of Blocked NT Kernel

For the blocked NT kernel, we can analyze the L1 cache utilization in the same

way as for the NN kernel. We start by noting that work-item ( j, i ), in iteration

k , performs the memory accesses

A [2 i,k ] ,A [2 i +1 ,k ] ,B [2 j, k ] ,B [2 j +1 ,k ] ,

and we see that we should again consider four iterations over k , to get full cache

lines:

A [2 i,k +0] ,A [2 i +1 ,k +0] ,B [ j, k +0] ,B [2 j +1 ,k +0];

A [2 i,k +1] ,A [2 i +1 ,k +1] ,B [ j, k +1] ,B [2 j +1 ,k +1];

A [2 i,k +2] ,A [2 i +1 ,k +2] ,B [ j, k +2] ,B [2 j +1 ,k +2];

A [2 i,k +3] ,A [2 i +1 ,k +3] ,B [ j, k +3] ,B [2 j +1 ,k +3] .

During execution of the first four iterations, the first work-group, with its

λ 0 λ 1 work-items, accesses

k =0 λ 1 − 1

A [2 i,k ] , λ 1 − 1

A [2 i +1 ,k ] ,

i =0

j =0

i =0

j =0

λ 0 − 1

λ 1 − 1

B [2 j, k ] , λ 1 − 1

B [2 j +1 ,k ]

i =0

j =0

i =0

j =0

λ 0 −

and the first set of M simultaneous work-groups accesses

k =0 M− 1

A [2 i,k ] , M− 1

A [2 i +1 ,k ] ,

m =0 λ 1 − 1

i =0

j = mλ 0

m =0 λ 1 − 1

i =0

j = mλ 0

mλ 0 + λ 0 − 1

M− 1

B [2 j, k ] , M− 1

B [2 j +1 ,k ] ,

m =0 λ 1 − 1

i =0

j = mλ 0

m =0 λ 1 − 1

i =0

j = mλ 0

mλ 0 + λ 0 − 1

21 The roles of λ 0 and λ 1 are interchanged between the two versions, but the first one is still

always better.

GPU Pro: Advanced Rendering Techniques

Search WWH ::

Custom Search

Home