Graphics Reference
In-Depth Information
float4 a = A [ i , k ];
float4 b0 = B [ k +0, j ];
float4 b1 = B [ k +1, j ];
float4 b2 = B [ k +2, j ];
float4 b3 = B [ k +3, j ];
ab += a . s0 ￿ b0 + a . s1 ￿ b1 + a . s2 ￿ b2 + a . s3 ￿ b3 ;
where ab (of type float4 ) is the accumulator for the 1
4blockof C and all
operations are vector operations. The kernel is shown in Listing 7.11.
For the NT variant, we instead select (Δ I =2,Δ J =2,Δ K reg . =4)and
implement the multiplication between the 2 × 4blockof A and the 2 × 4block
of the transposed B as
×
float4 a0 = A [ i , k ];
float4 a1 = A [ i +1, k ];
float4 b0 = B [ j , k ];
float4 b1 = B [ j +1, k ];
ab . s01 += ( float2 )( dot ( a0 , b0 ), dot ( a0 , b1 ));
ab . s23 += ( float2 )( dot ( a1 , b0 ), dot ( a1 , b1 ));
where ab is an accumulator variable of type float4 for the 2
×
2 block of the
matrix C . 14 . The full kernel is shown in Listing 7.12
kernel void
sgemm ( global float4 const ￿ A , global float4 const ￿ B ,
global float4 ￿ C , float alpha , float beta , uint n )
{ uint j = get global id (0);
uint i = get global id (1);
uint nv4 = n >> 2;
float4 accum =( float4 )0.0 f ;
for ( uint k =0; k < nv4 ;++ k )
{ float4 a = A [ i ￿ nv4 + k ];
float4 b0 = B [(4 ￿ k +0) ￿ nv4 + j ];
float4 b1 = B [(4 ￿ k +1) ￿ nv4 + j ];
float4 b2 = B [(4 ￿ k +2) ￿ nv4 + j ];
float4 b3 = B [(4 ￿ k +3) ￿ nv4 + j ];
accum += a . s0 ￿ b0 + a . s1 ￿ b1 + a . s2 ￿ b2 + a . s3 ￿ b3 ;
C [ i ￿ nv4 + j ]= alpha ￿ accum + beta ￿ C [ i ￿ nv4 + j ];
}
Listing 7.11. Vectorized implementation: blockedNN .
14 The components ab.s01 accumulate the top row and the components ab.s23 accumulate
the bottom row of the 2 × 2block.
Search WWH ::




Custom Search