Optimizing OpenCL Kernels for the ARM Mali-T600 GPUs - GPU Pro: Advanced Rendering Techniques

Graphics Reference

In-Depth Information

Created threads enter the tri-pipe in a round-robin order. A core switches

between threads on every cycle: when a thread has executed one instruction, it

then waits while all other threads execute one instruction. 7 Sometimes a thread

can stall waiting for a cache miss and another thread will overtake it, changing

the ordering between threads. (We will discuss this aspect later in Section 7.5.)

7.3.4 Guidelines for Optimizing Performance

A compute program (kernel) typically consists of a mix of A and LS instruction

words. 8 Achieving high performance on the Mali-T604 involves the following:

•

Using a sucient number of active threads to hide the execution latency of

instructions (pipeline depth). The number of active threads depends on the

number of registers used by kernel code and so may be limited for complex

kernels.

• Using vector operations in kernel code to allow for straightforward mapping

to vector instructions by the compiler.

•

Having sucient instruction level parallelism in kernel code to allow for

dense packing of instructions into instruction words by the compiler.

•

Having a balance between A and LS instruction words. Without cache

misses, the ratio of 2:1 of A-words to LS-words would be optimal; with

cache misses, a higher ratio is desirable. For example, a kernel consisting

of 15 A-words and 7 LS-words is still likely to be bound by the LS-pipe.

In several respects, programming for the Mali-T604 GPU embedded on a

System-on-Chip (SoC) is easier than programming for desktop class GPUs:

•

The global and local OpenCL address spaces get mapped to the same

physical memory (the system RAM), backed by caches transparent to the

programmer. This often removes the need for explicit data copying and

associated barrier synchronization.

•

All threads have individual program counters. This means that branch

divergence is less of an issue than for warp-based architectures.

7 There is more parallelism in the hardware than this sentence mentions, but the description

here suces for the current discussion.

8 The texture (T) pipeline is rarely used for compute kernels, with a notable exception of

executing barrier operations (see Section 7.5). The main reason is that when performing memory

accesses using vector instructions in the LS pipeline results in higher memory bandwidth (bytes

per cycle) than using instructions in the T pipeline for kernels requiring no sampling.

Search WWH ::

Custom Search

Home