Optimizing OpenCL Kernels for the ARM Mali-T600 GPUs - GPU Pro: Advanced Rendering Techniques

Graphics Reference

In-Depth Information

we can qualitatively say that the longer the program runs, the stronger divergence

we see. This implies that the benefits of a low L1 cache utilization fraction have

a stronger impact on performance for large matrix sizes.

The innermost index. Now that we have discussed cache usage, we will return to

one aspect of the NN variant that we did not discuss before, namely the choice

to use j as the innermost index in the sequencing of operations. We have chosen

the order by using global_id(0) for the index j and global_id(1) for the index i

in our CL C kernels. The argument we will provide here holds only for the NN

versions, and is not applicable for the NT versions.

If we again consider the memory read sequence for the scalar NN variant, and

only look at a single iteration and a single work-group, we see

λ 1 − 1

A [ i,k ] , λ 1 − 1

B [ k,j ]

i =0

j =0

i =0

j =0

λ 0 − 1

= λ 1 − 1

, λ 0 − 1

B [ k,j ]

i =0

j =0

A [ i,k ]

λ 0

λ 1 ,

but we could instead have swapped the roles of the global_id s and gotten

λ 1 − 1

A [ i,k ] , λ 1 − 1

B [ k,j ]

j =0

i =0

j =0

i =0

λ 0 − 1

A [ i,k ]

λ 1 , λ 1 − 1

i =0

j =0

B [ k,j ]

λ 0

We will now look for the number of cache-line switches in the execution of

those reads. For the former case, the reads from A are described by

λ 1 − 1

i =0

λ 0 , which switches cache line λ 1 times (and between those switches, we read

from the same matrix element λ 0 times). For different values of i we read from

different rows of the matrix, and separate rows are always on separate cache

lines. 20 Staying with the first version, but looking at B ,weseethatweswitch

matrix element λ 0 λ 1 times, as every read is from a different element than the

previous read, and we perform λ 0 λ 1 reads. We note, however, that we switch

between consecutive elements in memory, which means that we only have λ 0 λ 1 / 16

cache line switches, as we can fit 16 matrix elements into a cache line. For the

second version, with the indices in the other order, analogous considerations

show that we switch matrix element λ 0 λ 1 times for A and λ 1 times for B , but

A [ i,k ]

20 Except for very small matrices, of course, but those cases do not require cache analysis

anyway.

GPU Pro: Advanced Rendering Techniques

Search WWH ::

Custom Search

Home