Optimizing OpenCL Kernels for the ARM Mali-T600 GPUs - GPU Pro: Advanced Rendering Techniques

Graphics Reference

In-Depth Information

7.3.1 Architecture Overview

The Mali-T604 GPU is formed of four identical cores, each supporting up to 256

concurrently executing ( active ) threads. 4

Each core contains a tri-pipe containing two arithmetic (A) pipelines, one

load-store (LS) pipeline, and one texture (T) pipeline. Thus, the peak throughput

of each core is two A instruction words, one LS instruction word, and one T

instruction word per cycle. Midgard is a VLIW (Very Long Instruction Word)

architecture, so that each pipe contains multiple units and most instruction words

contain instructions for multiple units. In addition, Midgard is a SIMD (Single

Instruction Multiple Data) architecture, so that most instructions operate on

multiple data elements packed in 128-bit vector registers.

7.3.2 Execution Constraints

The architectural maximum number of work-items active on a single core is

max( I ) = 256. The actual maximum number of active work-items I is deter-

mined by the number of registers R that the kernel code uses

⎧

⎨

256 , 0 <R

≤

4 ,

I =

128 , 4 <R

≤

8 ,

⎩

64 ,

8 <R

≤

16 .

For example, kernel A that uses R A = 5 registers and kernel B that uses R B =8

registers can both be executed by no more than 128 work-items. 5

7.3.3 Thread Scheduling

The GPU schedules work-groups onto cores in batches, whose size is chosen by

the driver depending on the characteristics of the job. The hardware schedules

batches onto cores in a round-robin fashion. A batch consists of a number of

“adjacent” work-groups. 6

Each core first creates threads for the first scheduled work-group and then

continues to create threads for the other scheduled work-groups until either the

maximum number of active threads has been reached or all threads for the sched-

uled work-groups have been created. When a thread terminates, a new thread

can be scheduled in its place.

4 In what follows, we assume that a single hardware thread executes a single work-item.

A program transformation known as thread coarsening can result in a single hardware thread

executing multiple work-items, e.g., in different vector lanes.

5 Therefore, the compiler may prefer to spill a value to memory rather than use an extra

register when the number of used registers approaches 4, 8, or 16.

6 In our examples using 2D ND-ranges, two “adjacent” work-groups have work-items that

are adjacent in the 2D space of global IDs (see Section 7.5.6 for a more detailed description).

GPU Pro: Advanced Rendering Techniques

Search WWH ::

Custom Search

Home