Optimizing OpenCL Kernels for the ARM Mali-T600 GPUs - GPU Pro: Advanced Rendering Techniques

Graphics Reference

In-Depth Information

core is 128 (which seemed optimal in the Sobel study), and if λ 0 λ 1 = 128, then we

have only a single work-group on the core, but we could have chosen λ 0 = λ 1 =4,

which would give us 128 / 16 = 8 work-groups executing simultaneously on a core.

As before, it will be beneficial to look at the cache usage over four iterations

over k , and we can easily generalize the results we had before to see that a single

work-group reads λ 1 full cache lines from A and 4 λ 0 full cache lines from B for

every four iterations (provided that λ 0 is a multiple of 4).

If λ 0 λ 1 ≤

64, we have more than one work-group executing simultaneously

on the core. In this case, the work-groups that are simultaneously active on a

code will have consecutive values of m and identical values of n .Weseethat

the reads from A read from the same cache lines, so they are reused between the

work-groups. We said above that a few work-groups are sent to each core, and

we assume that the work-groups we are having active at the same time belong to

this set, as we would otherwise not have consecutive group IDs ( m, n ). 18

This means that the 128 work-items executing simultaneously on one core use

λ 1 +4 λ 0 [number of work-groups] = λ 1 +4 λ 0 128

λ 0 λ 1 = λ 1 + 512 /λ 1

cache lines from the L1 cache for four consecutive iterations in k . Asthisex-

pression is independent of λ 0 , we can select our λ 0 freely (as long as it is a

multiple of 4), and the only effect we see (from our analysis so far) is that a

larger λ 0 restricts our possible choices for λ 1 . With λ 1 =1 , 2 , 4 , 8 , 16 , 32 , 64 , 128,

weseethatwerequire513 , 258 , 132 , 72 , 48 , 48 , 72 , 132 cache lines, and with room

for 256 lines in the L1 cache of each core, the fraction of L1 we need to use is 19

2 . 0 , 1 . 0 , 0 . 52 , 0 . 28 , 0 . 19 , 0 . 19 , 0 . 28 , 0 . 52. Under our assumptions of fully associa-

tive cache and perfect execution order between work-items, we would expect all

options with a value below 1 to have the same performance (disregarding the

effects of L2 and RAM).

As we know that our assumptions are incorrect, though, we need to discuss

what happens when executing on a real GPU. First, due to the design of the

cache, we will see cache misses before the cache is 100% filled, i.e., earlier than our

analysis above would have predicted. The more complicated aspect of execution

is that the work-items that are spawned in the order we describe here do not

keep the order. When one work-item is stalled on a cache miss, other work-items

may overtake it, so we will have active work-items that are executing different

iterations (different values of k ) at the same time. We refer to this as thread

divergence (or work-item divergence), and the fraction of L1 we need is a measure

of our robustness to keep having good performance in cases of thread divergence.

Thread divergence always happens and is dicult to measure and quantify, but

18 With work-group divergence, i.e., with a few work-items each from many work-groups

partially finished on the same core, we might have work-groups with very different group_id s

simultaneously active on the same core.

19 The numbers are also shown in Table 7.2.

GPU Pro: Advanced Rendering Techniques

Search WWH ::

Custom Search

Home