Graphics Reference
In-Depth Information
float4
a
=
A
[
i
,
k
];
float4
b0
=
B
[
k
+0,
j
];
float4
b1
=
B
[
k
+1,
j
];
float4
b2
=
B
[
k
+2,
j
];
float4
b3
=
B
[
k
+3,
j
];
ab
+=
a
.
s0
b0
+
a
.
s1
b1
+
a
.
s2
b2
+
a
.
s3
b3
;
where
ab
(of type
float4
) is the accumulator for the 1
4blockof
C
and all
operations are vector operations. The kernel is shown in Listing 7.11.
For the NT variant, we instead select (Δ
I
=2,Δ
J
=2,Δ
K
reg
.
=4)and
implement the multiplication between the 2
×
4blockof
A
and the 2
×
4block
of the transposed
B
as
×
float4
a0
=
A
[
i
,
k
];
float4
a1
=
A
[
i
+1,
k
];
float4
b0
=
B
[
j
,
k
];
float4
b1
=
B
[
j
+1,
k
];
ab
.
s01
+= (
float2
)(
dot
(
a0
,
b0
),
dot
(
a0
,
b1
));
ab
.
s23
+= (
float2
)(
dot
(
a1
,
b0
),
dot
(
a1
,
b1
));
where
ab
is an accumulator variable of type
float4
for the 2
×
2 block of the
matrix
C
.
14
. The full kernel is shown in Listing 7.12
kernel void
sgemm
(
global float4 const
A
,
global float4 const
B
,
global float4
C
,
float
alpha
,
float
beta
,
uint
n
)
{
uint
j
=
get global id
(0);
uint
i
=
get global id
(1);
uint
nv4
=
n
>>
2;
float4
accum
=(
float4
)0.0
f
;
for
(
uint
k
=0;
k
<
nv4
;++
k
)
{
float4
a
=
A
[
i
nv4
+
k
];
float4
b0
=
B
[(4
k
+0)
nv4
+
j
];
float4
b1
=
B
[(4
k
+1)
nv4
+
j
];
float4
b2
=
B
[(4
k
+2)
nv4
+
j
];
float4
b3
=
B
[(4
k
+3)
nv4
+
j
];
accum
+=
a
.
s0
b0
+
a
.
s1
b1
+
a
.
s2
b2
+
a
.
s3
b3
;
C
[
i
nv4
+
j
]=
alpha
accum
+
beta
C
[
i
nv4
+
j
];
}
Listing 7.11.
Vectorized implementation:
blockedNN
.
14
The components
ab.s01
accumulate the top row and the components
ab.s23
accumulate
the bottom row of the 2
×
2block.