Graphics Reference
In-Depth Information
Created threads enter the tri-pipe in a round-robin order. A core switches
between threads on every cycle: when a thread has executed one instruction, it
then waits while all other threads execute one instruction. 7 Sometimes a thread
can stall waiting for a cache miss and another thread will overtake it, changing
the ordering between threads. (We will discuss this aspect later in Section 7.5.)
7.3.4 Guidelines for Optimizing Performance
A compute program (kernel) typically consists of a mix of A and LS instruction
words. 8 Achieving high performance on the Mali-T604 involves the following:
Using a sucient number of active threads to hide the execution latency of
instructions (pipeline depth). The number of active threads depends on the
number of registers used by kernel code and so may be limited for complex
kernels.
Using vector operations in kernel code to allow for straightforward mapping
to vector instructions by the compiler.
Having sucient instruction level parallelism in kernel code to allow for
dense packing of instructions into instruction words by the compiler.
Having a balance between A and LS instruction words. Without cache
misses, the ratio of 2:1 of A-words to LS-words would be optimal; with
cache misses, a higher ratio is desirable. For example, a kernel consisting
of 15 A-words and 7 LS-words is still likely to be bound by the LS-pipe.
In several respects, programming for the Mali-T604 GPU embedded on a
System-on-Chip (SoC) is easier than programming for desktop class GPUs:
The global and local OpenCL address spaces get mapped to the same
physical memory (the system RAM), backed by caches transparent to the
programmer. This often removes the need for explicit data copying and
associated barrier synchronization.
All threads have individual program counters. This means that branch
divergence is less of an issue than for warp-based architectures.
7 There is more parallelism in the hardware than this sentence mentions, but the description
here suces for the current discussion.
8 The texture (T) pipeline is rarely used for compute kernels, with a notable exception of
executing barrier operations (see Section 7.5). The main reason is that when performing memory
accesses using vector instructions in the LS pipeline results in higher memory bandwidth (bytes
per cycle) than using instructions in the T pipeline for kernels requiring no sampling.
Search WWH ::




Custom Search