Graphics Reference
In-Depth Information
which means that on each iteration 0
,
k<N all work-items first read from A
and then from B .The
j<λ 0 is the innermost one, due to
the order in which threads are created by the device. 12
operator for 0
7.5.5 Blocking
As a first step towards better implementations, we will introduce blocking ,a
program transformation that will allow us to use vector operations for memory
accesses and arithmetic operations. We will later exploit blocking to improve
cache usage.
Let us assume that matrix order N is divisible by blocking factors Δ I J
and Δ K reg . . Imagine that
matrix A consists of Δ I ×
N
Δ K reg . submatrices of Δ I × Δ K reg . elements each,
N
N
Δ J
matrix B consists of
Δ K reg . ×
submatrices of Δ K reg . ×
Δ J elements
each,
N
N
matrix C consists of
Δ I ×
Δ J submatrices of Δ I
×
Δ J elements each.
Now, instead of writing the SGEMM NN algorithm as
C [ i,j ]= α
i,j
A [ i,k ]
×
B [ k,j ]+ βC [ i,j ] ,
where A [ i,k ], B [ k,j ]and C [ i,j ] were individual elements, we can write it as
C [ I,J ]= α
I,J
A [ I,K ]
×
B [ K,J ]+ βC [ I,J ] ,
where A [ I,K ], B [ K,J ]and C [ I,J ] are submatrices as above and
×
is the matrix
multiplication operation.
Our kernels will still have the same structure, but each work-item will now
compute one Δ I
×
Δ J submatrix of C [ I,J ]. Each iteration of the k loop will
multiply a Δ I
Δ J matrix.
For the NN variant, where both A and B use normal matrix layout, we im-
plement the matrix multiplication between a 1
×
Δ K reg . matrix by a Δ K reg . ×
×
4blockfrom A and the 4
×
4
block from B as 13
12 The GPU increments the ID associated with λ 0 as the innermost index, and by assigning
get_global_id(0) to j ,wehavechosen j as the innermost index in our implementation. This
choice will be discussed in Section 7.5.6
13 In code snippets such as these, we gloss over some details, such as the fact that we need 4 k
instead of k as the offset in the reads from B .
Search WWH ::




Custom Search