Graphics Reference
In-Depth Information
kernel void
sgemm ( global float const ￿ A , global float const ￿ B ,
global float ￿ C , float alpha , float beta , uint n )
{ uint j = get global id (0);
uint i = get global id (1);
float ABij =0.0 f ;
for ( uint k =0; k < n ;++ k )
{ ABij += A [ i ￿ n + k ] ￿ B [ j ￿ n + k ];
C [ i ￿ n + j ]= alpha
ABij + beta
C [ i ￿ n + j ];
}
Listing 7.10. Initial scalar implementation: scalarNT .
Transposed. Each work-item of the scalarNT version in Listing 7.10 produces one
element of C by computing the dot product of a row of A and a column of B T
(or equivalently a row of B ).
7.5.4 Memory Access Patterns of Scalar Implementations
A single work-item of the scalarNN version sequentially reads (within the k loop)
from pairs of locations ( A [ i, 0] ,B [0 ,j ]) , ( A [ i, 1] ,B [1 ,j ]) , ..., ( A [ i,N
1] ,
B [ N
1 ,j ]). We will abbreviate this access pattern to
k =0 ( A [ i,k ] ,B [ k,j ]) ,
N− 1
which denotes that the accesses happen sequentially for 0 ≤ k<N .
Similarly, the access pattern of a single work-item of the scalarNT version is
k =0 ( A [ i,k ] ,B [ j, k ]) .
N− 1
With the row-major array layout used in the C language, the scalarNT variant
reads both A and B with stride 1, while the scalarNN variant reads B with
stride N .
Let us assume a core executes a single work-group of dimensions ( λ 0 , λ 1 ).
Since work-items execute in an interleaved order (Section 7.3.3), the actual mem-
ory access pattern of the scalarNN variant on the core will be
k =0 λ 1 1
λ 1 1
,
i =0
,
j =0
,
,
i =0
,
j =0
N− 1
λ 0 1
λ 0 1
A [ i,k ]
B [ k,j ]
,
Search WWH ::




Custom Search