Optimizing OpenCL Kernels for the ARM Mali-T600 GPUs - GPU Pro: Advanced Rendering Techniques

Graphics Reference

In-Depth Information

This notation allows us to write a single iteration over k as

32 n +31

4 , 4 m +3

B [4 k +0 ,j ]

32 , 4 m +3

B [4 k +1 ,j ]

i =32 n

j =4 m

A [ i,k ]

32 ,

4 m +3

B [4 k +2 ,j ]

32 , 4 m +3

B [4 k +3 ,j ]

j =4 m

32 .

As a cache line has space for four float4 elements, we see that the reads from

A read the first quarter of 32 consecutive cache lines and the reads from B read

four full cache lines. To get full cache lines instead, we consider four consecutive

iterations in k together, and we see that those four iterations read 32 full cache

lines from A and 16 full cache lines from B . For the moment, we restrict ourselves

to considering a single work-group, and we note that these cache lines will never

be reused by later operations in the same work-group. We have now arrived at

our conclusion for the L1 cache requirements of the loop. If our L1 cache has

enough space for 48 cache lines, then we will never read the same value into the

L1 cache twice while executing the loop for all work-items in a work-group, as all

subsequent uses will be able to reuse the value that is stored in the cache.

After the loop has completed, the work-group additionally has to load and

store to C , which needs access to a 32

4 blocks of C , spanning

32 complete cache lines, meaning that (aslongasourL1cacheisatleast32

cache lines large) we will not see any lack of reuse for the elements of C within a

work-group.

If we continue to assume that we only have a single work-group at a time, and

consider the possibilities for cache reuse between consecutively scheduled work-

groups on the same core, we need to consider the state of the L1 cache when

work-group ( n,m ) finishes execution. The L1 cache contains 256 cache lines in

total, and the operations on C will have filled 32 of those, so 224 remain for A

and B . Each sequence of four iterations needs 48 cache lines, so the number of

iterations that have their cache lines still in cache at work-group completion is

4(256

4 block of 1

32) / 48, or 16, and this lets us see that the work-group size where we

may have reuse between work-groups is when we only need 16 sequences of four

iterations each in the loop, or 64 iterations, which corresponds to a matrix size

of 256

−

256 (as we chose Δ K reg . = 4). For larger matrices, no reuse between

consecutively scheduled work-groups on the same core is possible.

Arbitrary local work size. For an arbitrary local work size, we have two reasons to

redo the above analysis. First, we have the obvious reason that we get a different

interleaved read pattern between the work-items within a work-group. Second,

we can have more than one work-group simultaneously active on the same core,

if we choose a smaller local work size.

With a local work size of ( λ 0 ,λ 1 ), we need to look at all work-groups that are

running simultaneously on the same core. If the total number of work-items on the

GPU Pro: Advanced Rendering Techniques

Search WWH ::

Custom Search

Home