Optimizing OpenCL Kernels for the ARM Mali-T600 GPUs - GPU Pro: Advanced Rendering Techniques

Graphics Reference

In-Depth Information

which means that on each iteration 0

≤

k<N all work-items first read from A

and then from B .The

j<λ 0 is the innermost one, due to

the order in which threads are created by the device. 12

operator for 0

≤

7.5.5 Blocking

As a first step towards better implementations, we will introduce blocking ,a

program transformation that will allow us to use vector operations for memory

accesses and arithmetic operations. We will later exploit blocking to improve

cache usage.

Let us assume that matrix order N is divisible by blocking factors Δ I ,Δ J

and Δ K reg . . Imagine that

• matrix A consists of Δ I ×

Δ K reg . submatrices of Δ I × Δ K reg . elements each,

Δ J

•

matrix B consists of

Δ K reg . ×

submatrices of Δ K reg . ×

Δ J elements

each,

•

matrix C consists of

Δ I ×

Δ J submatrices of Δ I

Δ J elements each.

Now, instead of writing the SGEMM NN algorithm as

C [ i,j ]= α

i,j

A [ i,k ]

B [ k,j ]+ βC [ i,j ] ,

where A [ i,k ], B [ k,j ]and C [ i,j ] were individual elements, we can write it as

C [ I,J ]= α

I,J

A [ I,K ]

B [ K,J ]+ βC [ I,J ] ,

where A [ I,K ], B [ K,J ]and C [ I,J ] are submatrices as above and

is the matrix

multiplication operation.

Our kernels will still have the same structure, but each work-item will now

compute one Δ I

Δ J submatrix of C [ I,J ]. Each iteration of the k loop will

multiply a Δ I

Δ J matrix.

For the NN variant, where both A and B use normal matrix layout, we im-

plement the matrix multiplication between a 1

Δ K reg . matrix by a Δ K reg . ×

4blockfrom A and the 4

block from B as 13

12 The GPU increments the ID associated with λ 0 as the innermost index, and by assigning

get_global_id(0) to j ,wehavechosen j as the innermost index in our implementation. This

choice will be discussed in Section 7.5.6

13 In code snippets such as these, we gloss over some details, such as the fact that we need 4 k

instead of k as the offset in the reads from B .

GPU Pro: Advanced Rendering Techniques

Search WWH ::

Custom Search

Home