Information Technology Reference
In-Depth Information
Algorithm: [ C ]:=sbmm BLK outer ( C,A,B,k )
Partition C ₒ C L C R ,Bₒ B L B R
where C L ,B L have 0 columns
while n ( C L ) <n ( C ) do
Determine block size c
Repartition
C L C R C 0 C 1 C 2 , B L B R B 0 B 1 B 2
where C 1 ,B 1 have c columns
C 1 := sbmm BLK inner ( C 1 ,A,B 1 ,k )
Continue with
C L C R C 0 C 1 C 2 , B L B R B 0 B 1 B 2
endwhile
Fig. 2. Outer loop of Algorithm sbmm BLK that computes C := A · B + C
storage format. An analogous procedure that accesses the elements in the upper
triangle or all the elements of A is straight-forward.
The algorithm consists of two loops. The outer loop (Figure 2) partitions
matrices B and C into blocks of c columns; at every iteration of the loop, the
elements in the active column-block of C are updated. The inner loop (Figure 3)
proceeds along the main diagonal of A (top-left to down-right), updating the
corresponding elements of C . Matrices B and C are partitioned row-wise, while
matrix A requires a 3
3 partition. At each iteration, blocks A i 1 and B i with
i =1 , 2 , 3, are accessed; while block C i where i =1 , 2 , 3, is updated. Note that
A 11 and A 31 are respectively, lower and upper triangular. Figure 4 details the
blocks accessed and updated at a given iteration of the inner loop.
The execution of Algorithm sbmm BLK can be adapted to the underlying
architecture and problem by carefully choosing parameters c and b . Parameter
c defines the number of columns of C computed at a given iteration of the outer
loop, while b corresponds to the number of columns of A that are accessed at
each iteration of the inner loop. The optimal values for c and b strongly depends
on the memory organisation of the target architecture. For example, in current
multicore processors, b = 32 is usually a convenient choice, while for GPUs larger
values are recommended (e.g. b = 128).
×
4 Implementations
We next present three routines to compute the symmetric band matrix-matrix
product on a CPU-GPU system. All the implementations intensively invoke ker-
nels from CUBLAS.
 
Search WWH ::




Custom Search