Graphics Reference
In-Depth Information
kernel void
sgemm ( global float4 ￿ const A , global float4 ￿ const B ,
global float2 ￿ C , float alpha , float beta , uint n )
{ uint i = get global id (0);
uint j = get global id (1);
uint nv4 = n >> 2;
float4 ab =( float4 )(0.0 f );
for ( uint k =0; k < nv4 ;++ k )
{ float4 a0 = A [2 ￿ i
nv4 + k ];
float4 a1 = A [(2 ￿ i +1) ￿ nv4 + k ];
float4 b0 = B [2 ￿ j
nv4 + k ];
float4 b1 = B [(2 ￿ j +1) ￿ nv4 + k ];
ab += ( float4 )( dot ( a0 , b0 ), dot ( a0 , b1 ),
dot ( a1 , b0 ), dot ( a1 , b1 ));
} uint ix =2 ￿ i ￿ ( n >> 1) + j ;
C [ ix ] = alpha ￿ ab . s01 + beta ￿ C [ ix ];
C [ ix +( n >> 1) ] = alpha ￿ ab . s23 + beta ￿ C [ ix +( n >> 1) ] ;
}
Listing 7.12. Vectorized implementation: blockedNT .
We saw the need to introduce blocking to enable the use of vector operations,
but register blocking also decreases the number of loads necessary. Our scalar
implementations (both the NN and NT variants) loaded N elements of A and N
elements of B to compute one element of C , so we needed to load ( N + N ) N 2 =
2 N 3 elements from A and B . in general, we need to load one Δ I
×
Δ K reg . block
from A and one Δ K reg . ×
Δ J block from B per iteration, and we need N/ Δ K reg .
iterations. We need one work-item for each of the ( N/ Δ I )( N/ Δ J ) blocks in C ,
which gives us a total of
Δ J = N 3 1
N
Δ K reg .
N
Δ I
N
1
Δ I
I Δ K reg . K reg . Δ J )
Δ J +
elements to be loaded into registers.
The above result tells us that we should want to choose Δ I and Δ J large and
similar, while the choice of Δ K reg . is less important. We always set Δ K reg . to 4,
as this is the smallest value that allows us to use vector operations. 15
In NN implementations, we have to also choose Δ J as a multiple of 4, to
allow for the use of vector operations, whereas Δ I J = 2 is one option we
may choose in the NT case. We can compute the difference in load requirements
between the 1
2 implementations 16 by computing 1 / Δ I +1 / Δ J
×
4
×
4and2
×
4
×
15 We note that the scalar version corresponds to Δ I J K reg . =1.
16 We will sometimes refer to a blocking by writing it Δ I × Δ K reg . × Δ J .
Search WWH ::




Custom Search