Graphics Reference
In-Depth Information
#define di (( uint )2)
#define dj (( uint )2)
#define dk (( uint )32)
kernel void
sgemm ( global float4 const ￿ A , global float4 const ￿ B ,
global float2 ￿ C , float alpha , float beta , uint n )
{ uint j = get global id (0);
uint i = get global id (1);
uint nv4 = n >> 2;
float4 ab =( float4 )0.0 f ;
for ( uint k =0; k < nv4 ; k += dk )
{ for ( uint kk =0; kk < dk ;++ kk )
{ float4 a0 = A [2 ￿ i
nv4 + kk + k ];
float4 a1 = A [(2 ￿ i +1) ￿ nv4 + kk + k ];
float4 b0 = B [2 ￿ j
nv4 + kk + k ];
float4 b1 = B [(2 ￿ j +1) ￿ nv4 + kk + k ];
ab += ( float4 )( dot ( a0 , b0 ), dot ( a0 , b1 ),
dot ( a1 , b0 ), dot ( a1 , b1 ));
barrier ( CLK_GLOBAL_MEM_FENCE );
}
uint ix =2 ￿ i ￿ ( n >> 1) + j ;
C [ ix ] = alpha ￿ ab . s01 + beta ￿ C [ ix ];
C [ ix +( n >> 1) ] = alpha ￿ ab . s23 + beta ￿ C [ ix +( n >> 1) ] ;
}
Listing 7.13. Cache-blocked implementation: cacheblockedNT . The constants di , dj ,
and dk correspond to our Δ I J ,andΔ K cache , respectively.
The benefit of the barrier is that we can get the same L1 cache sharing for
large matrices as we had for small matrices. The cost of executing the barrier is
due to the fact that we have to reassemble all work-items of the work-group at
regular intervals. If no thread divergence occurs, this means that all work-items
need to enter into the barrier, which takes time proportional to the number of
work-items, and then all work-items need to exit from the barrier, which again
takes a number of cycles proportional to the number of work-items involved.
This means that the effective cost of a barrier is the number of cycles it takes,
times the number of work-items that are taking part in the barrier, or at least 22
2 λ 0 λ 1 . It is therefore beneficial for the actual execution time of the barrier to
22 Executing the barrier instruction takes 2 λ 0 λ 1 cycles, and in this time the λ 0 λ 1 work-
items involved are prevented from performing other work, which means that the work lost due
to the barrier instruction is 2 λ 0 λ 1 . Again, the description is simplified but sucient for
our needs (e.g., while the last work-item exits from the barrier, the first work-item is already
performing work again).
Search WWH ::




Custom Search