Optimizing OpenCL Kernels for the ARM Mali-T600 GPUs - GPU Pro: Advanced Rendering Techniques

Graphics Reference

In-Depth Information

for them, and we find 1.25 and 1, respectively, showing that the second blocking

needs fewer load operations. We see that the added flexibility we are given to

choose blocking widths in the NT version can help us decrease the number of

memory operations, and this is one reason to expect that the NT variants will

perform better on the GPU.

In both cases, however, we perform O ( N 3 ) load operations from data of size

O ( N 2 ), which means that each datum is reloaded into registers O ( N ) times.

Effective cache usage will clearly be important, as it allows us to access data

from caches rather than from main memory.

7.5.6 L1 Cache Analysis of the 1

×

4

×

4 Blocked NN SGEMM

Overview. To estimate the cache usage of our kernels, we have to take into account

the order of reads within a work-item, as well as the fact that many work-items

are active at the same time. We will first look at one specific choice of program

and local work size and then try to extend the analysis to be able to compare the

cache usage of different implementations and local work sizes.

In the program we choose to analyze first, the 1

4 blocked NN implemen-

tation, every work-item performs five memory operations per iteration in its loop,

and we will assume that the memory operations take place in the same order as

they appear in the program (i.e., the compiler does not change their order), and

that, for a given work-item, memory operations never execute in parallel. An-

other restriction that we will also make in our analysis, and which is important,

is that we are able to perfectly predict the order in which work-items execute on

the GPU. Finally, this section will only focus on the L1 cache, which allows us

to restrict the analysis to a single core, as each core has its own L1 cache.

×

4

×

Thread order. A single work-item loops over the variable k , and for every value of

k , it performs one memory load from A and four memory loads from B .Withour

assumptions, we know that the work-item 17 will enter the load-store pipeline once

for each of those instructions, in the order they appear in the program source.

We also know that we schedule one work-group of work-items at a time, that

those work-items execute their memory operations in an interleaved fashion one

after the other, and that they always do this in the order they were spawned by

the GPU. We will now see what how we can use that knowledge to analyze the

L1 data cache that we need to use.

17 From a hardware point of view, we of course discuss the behavior and order of threads, but

we continue to use the term work-item, remembering that one work-item in OpenCL corresponds

to one thread in the GPU.

GPU Pro: Advanced Rendering Techniques

Search WWH ::

Custom Search

Home