Graphics Reference
In-Depth Information
7.3.1 Architecture Overview
The Mali-T604 GPU is formed of four identical cores, each supporting up to 256
concurrently executing (
active
) threads.
4
Each core contains a tri-pipe containing two arithmetic (A) pipelines, one
load-store (LS) pipeline, and one texture (T) pipeline. Thus, the peak throughput
of each core is two A instruction words, one LS instruction word, and one T
instruction word per cycle. Midgard is a VLIW (Very Long Instruction Word)
architecture, so that each pipe contains multiple units and most instruction words
contain instructions for multiple units. In addition, Midgard is a SIMD (Single
Instruction Multiple Data) architecture, so that most instructions operate on
multiple data elements packed in 128-bit vector registers.
7.3.2 Execution Constraints
The architectural maximum number of work-items active on a single core is
max(
I
) = 256. The actual maximum number of active work-items
I
is deter-
mined by the number of registers
R
that the kernel code uses
⎧
⎨
256
,
0
<R
≤
4
,
I
=
128
,
4
<R
≤
8
,
⎩
64
,
8
<R
≤
16
.
For example, kernel
A
that uses
R
A
= 5 registers and kernel
B
that uses
R
B
=8
registers can both be executed by
no more than
128 work-items.
5
7.3.3 Thread Scheduling
The GPU schedules work-groups onto cores in batches, whose size is chosen by
the driver depending on the characteristics of the job. The hardware schedules
batches onto cores in a round-robin fashion. A batch consists of a number of
“adjacent” work-groups.
6
Each core first creates threads for the first scheduled work-group and then
continues to create threads for the other scheduled work-groups until either the
maximum number of active threads has been reached or all threads for the sched-
uled work-groups have been created. When a thread terminates, a new thread
can be scheduled in its place.
4
In what follows, we assume that a single hardware thread executes a single work-item.
A program transformation known as
thread coarsening
can result in a single hardware thread
executing multiple work-items, e.g., in different vector lanes.
5
Therefore, the compiler may prefer to spill a value to memory rather than use an extra
register when the number of used registers approaches 4, 8, or 16.
6
In our examples using 2D ND-ranges, two “adjacent” work-groups have work-items that
are adjacent in the 2D space of global IDs (see Section 7.5.6 for a more detailed description).