Graphics Reference
In-Depth Information
#define
di ((
uint
)2)
#define
dj ((
uint
)2)
#define
dk ((
uint
)32)
kernel void
sgemm
(
global float4 const
A
,
global float4 const
B
,
global float2
C
,
float
alpha
,
float
beta
,
uint
n
)
{
uint
j
=
get global id
(0);
uint
i
=
get global id
(1);
uint
nv4
=
n
>>
2;
float4
ab
=(
float4
)0.0
f
;
for
(
uint
k
=0;
k
<
nv4
;
k
+=
dk
)
{
for
(
uint
kk
=0;
kk
<
dk
;++
kk
)
{
float4
a0
=
A
[2
i
nv4
+
kk
+
k
];
float4
a1
=
A
[(2
i
+1)
nv4
+
kk
+
k
];
float4
b0
=
B
[2
j
nv4
+
kk
+
k
];
float4
b1
=
B
[(2
j
+1)
nv4
+
kk
+
k
];
ab
+= (
float4
)(
dot
(
a0
,
b0
),
dot
(
a0
,
b1
),
dot
(
a1
,
b0
),
dot
(
a1
,
b1
));
barrier
(
CLK_GLOBAL_MEM_FENCE
);
}
uint
ix
=2
i
(
n
>>
1) +
j
;
C
[
ix
] =
alpha
ab
.
s01
+
beta
C
[
ix
];
C
[
ix
+(
n
>>
1) ] =
alpha
ab
.
s23
+
beta
C
[
ix
+(
n
>>
1) ] ;
}
Listing 7.13.
Cache-blocked implementation:
cacheblockedNT
. The constants
di
,
dj
,
and
dk
correspond to our Δ
I
,Δ
J
,andΔ
K
cache
, respectively.
The benefit of the barrier is that we can get the same L1 cache sharing for
large matrices as we had for small matrices. The cost of executing the barrier is
due to the fact that we have to reassemble all work-items of the work-group at
regular intervals. If no thread divergence occurs, this means that all work-items
need to enter into the barrier, which takes time proportional to the number of
work-items, and then all work-items need to exit from the barrier, which again
takes a number of cycles proportional to the number of work-items involved.
This means that the effective cost of a barrier is the number of cycles it takes,
times the number of work-items that are taking part in the barrier, or at least
22
2
λ
0
λ
1
. It is therefore beneficial for the actual execution time of the barrier to
22
Executing the
barrier
instruction takes 2
λ
0
λ
1
cycles, and in this time the
λ
0
λ
1
work-
items involved are prevented from performing other work, which means that the work lost due
to the
barrier
instruction is 2
λ
0
λ
1
. Again, the description is simplified but sucient for
our needs (e.g., while the last work-item exits from the barrier, the first work-item is already
performing work again).