Optimizing OpenCL Kernels for the ARM Mali-T600 GPUs - GPU Pro: Advanced Rendering Techniques - page 343

Graphics Reference

In-Depth Information

float4 a = A [ i , k ];

float4 b0 = B [ k +0, j ];

float4 b1 = B [ k +1, j ];

float4 b2 = B [ k +2, j ];

float4 b3 = B [ k +3, j ];

ab += a . s0 b0 + a . s1 b1 + a . s2 b2 + a . s3 b3 ;

where ab (of type float4 ) is the accumulator for the 1

4blockof C and all

operations are vector operations. The kernel is shown in Listing 7.11.

For the NT variant, we instead select (Δ I =2,Δ J =2,Δ K reg . =4)and

implement the multiplication between the 2 × 4blockof A and the 2 × 4block

of the transposed B as

×

float4 a0 = A [ i , k ];

float4 a1 = A [ i +1, k ];

float4 b0 = B [ j , k ];

float4 b1 = B [ j +1, k ];

ab . s01 += ( float2 )( dot ( a0 , b0 ), dot ( a0 , b1 ));

ab . s23 += ( float2 )( dot ( a1 , b0 ), dot ( a1 , b1 ));

where ab is an accumulator variable of type float4 for the 2

×

2 block of the

matrix C . 14 . The full kernel is shown in Listing 7.12

kernel void

sgemm ( global float4 const A , global float4 const B ,

global float4 C , float alpha , float beta , uint n )

{ uint j = get global id (0);

uint i = get global id (1);

uint nv4 = n >> 2;

float4 accum =( float4 )0.0 f ;

for ( uint k =0; k < nv4 ;++ k )

{ float4 a = A [ i nv4 + k ];

float4 b0 = B [(4 k +0) nv4 + j ];

float4 b1 = B [(4 k +1) nv4 + j ];

float4 b2 = B [(4 k +2) nv4 + j ];

float4 b3 = B [(4 k +3) nv4 + j ];

accum += a . s0 b0 + a . s1 b1 + a . s2 b2 + a . s3 b3 ;

C [ i nv4 + j ]= alpha accum + beta C [ i nv4 + j ];

}

Listing 7.11. Vectorized implementation: blockedNN .

14 The components ab.s01 accumulate the top row and the components ab.s23 accumulate

the bottom row of the 2 × 2block.

Next Page

GPU Pro: Advanced Rendering Techniques

Search WWH ::

Custom Search

Home