Graphics Reference
In-Depth Information
kernel void
sgemm
(
global float const
A
,
global float const
B
,
global float
C
,
float
alpha
,
float
beta
,
uint
n
)
{
uint
j
=
get global id
(0);
uint
i
=
get global id
(1);
float
ABij
=0.0
f
;
for
(
uint
k
=0;
k
<
n
;++
k
)
{
ABij
+=
A
[
i
n
+
k
]
B
[
j
n
+
k
];
C
[
i
n
+
j
]=
alpha
ABij
+
beta
C
[
i
n
+
j
];
}
Listing 7.10.
Initial scalar implementation:
scalarNT
.
Transposed.
Each work-item of the
scalarNT
version in Listing 7.10 produces one
element of
C
by computing the dot product of a row of
A
and a column of
B
T
(or equivalently a row of
B
).
7.5.4 Memory Access Patterns of Scalar Implementations
A single work-item of the
scalarNN
version sequentially reads (within the
k
loop)
from pairs of locations (
A
[
i,
0]
,B
[0
,j
])
,
(
A
[
i,
1]
,B
[1
,j
])
,
..., (
A
[
i,N
−
1]
,
B
[
N
−
1
,j
]). We will abbreviate this access pattern to
k
=0
(
A
[
i,k
]
,B
[
k,j
])
,
N−
1
which denotes that the accesses happen sequentially for 0
≤ k<N
.
Similarly, the access pattern of a single work-item of the
scalarNT
version is
k
=0
(
A
[
i,k
]
,B
[
j, k
])
.
N−
1
With the row-major array layout used in the C language, the
scalarNT
variant
reads both
A
and
B
with stride 1, while the
scalarNN
variant reads
B
with
stride
N
.
Let us assume a core executes a single work-group of dimensions (
λ
0
,
λ
1
).
Since work-items execute in an interleaved order (Section 7.3.3), the actual mem-
ory access pattern of the
scalarNN
variant on the core will be
k
=0
λ
1
−
1
λ
1
−
1
,
i
=0
,
j
=0
,
,
i
=0
,
j
=0
N−
1
λ
0
−
1
λ
0
−
1
A
[
i,k
]
B
[
k,j
]
,