Graphics Reference
In-Depth Information
7.3.1 Architecture Overview
The Mali-T604 GPU is formed of four identical cores, each supporting up to 256
concurrently executing ( active ) threads. 4
Each core contains a tri-pipe containing two arithmetic (A) pipelines, one
load-store (LS) pipeline, and one texture (T) pipeline. Thus, the peak throughput
of each core is two A instruction words, one LS instruction word, and one T
instruction word per cycle. Midgard is a VLIW (Very Long Instruction Word)
architecture, so that each pipe contains multiple units and most instruction words
contain instructions for multiple units. In addition, Midgard is a SIMD (Single
Instruction Multiple Data) architecture, so that most instructions operate on
multiple data elements packed in 128-bit vector registers.
7.3.2 Execution Constraints
The architectural maximum number of work-items active on a single core is
max( I ) = 256. The actual maximum number of active work-items I is deter-
mined by the number of registers R that the kernel code uses
256 , 0 <R
4 ,
I =
128 , 4 <R
8 ,
64 ,
8 <R
16 .
For example, kernel A that uses R A = 5 registers and kernel B that uses R B =8
registers can both be executed by no more than 128 work-items. 5
7.3.3 Thread Scheduling
The GPU schedules work-groups onto cores in batches, whose size is chosen by
the driver depending on the characteristics of the job. The hardware schedules
batches onto cores in a round-robin fashion. A batch consists of a number of
“adjacent” work-groups. 6
Each core first creates threads for the first scheduled work-group and then
continues to create threads for the other scheduled work-groups until either the
maximum number of active threads has been reached or all threads for the sched-
uled work-groups have been created. When a thread terminates, a new thread
can be scheduled in its place.
4 In what follows, we assume that a single hardware thread executes a single work-item.
A program transformation known as thread coarsening can result in a single hardware thread
executing multiple work-items, e.g., in different vector lanes.
5 Therefore, the compiler may prefer to spill a value to memory rather than use an extra
register when the number of used registers approaches 4, 8, or 16.
6 In our examples using 2D ND-ranges, two “adjacent” work-groups have work-items that
are adjacent in the 2D space of global IDs (see Section 7.5.6 for a more detailed description).
Search WWH ::




Custom Search