Information Technology Reference
In-Depth Information
Algorithm:
[
C
]:=sbmm
BLK outer
(
C,A,B,k
)
Partition
C ₒ
C
L
C
R
,Bₒ
B
L
B
R
where
C
L
,B
L
have 0 columns
while
n
(
C
L
)
<n
(
C
)
do
Determine block size
c
Repartition
C
L
C
R
ₒ
C
0
C
1
C
2
,
B
L
B
R
ₒ
B
0
B
1
B
2
where
C
1
,B
1
have
c
columns
C
1
:= sbmm
BLK inner
(
C
1
,A,B
1
,k
)
Continue with
C
L
C
R
ₐ
C
0
C
1
C
2
,
B
L
B
R
ₐ
B
0
B
1
B
2
endwhile
Fig. 2.
Outer loop of Algorithm sbmm
BLK
that computes
C
:=
A · B
+
C
storage format. An analogous procedure that accesses the elements in the upper
triangle or all the elements of
A
is straight-forward.
The algorithm consists of two loops. The outer loop (Figure 2) partitions
matrices
B
and
C
into blocks of
c
columns; at every iteration of the loop, the
elements in the active column-block of
C
are updated. The inner loop (Figure 3)
proceeds along the main diagonal of
A
(top-left to down-right), updating the
corresponding elements of
C
. Matrices
B
and
C
are partitioned row-wise, while
matrix
A
requires a 3
3 partition. At each iteration, blocks
A
i
1
and
B
i
with
i
=1
,
2
,
3, are accessed; while block
C
i
where
i
=1
,
2
,
3, is updated. Note that
A
11
and
A
31
are respectively, lower and upper triangular. Figure 4 details the
blocks accessed and updated at a given iteration of the inner loop.
The execution of Algorithm sbmm
BLK
can be adapted to the underlying
architecture and problem by carefully choosing parameters
c
and
b
. Parameter
c
defines the number of columns of
C
computed at a given iteration of the outer
loop, while
b
corresponds to the number of columns of
A
that are accessed at
each iteration of the inner loop. The optimal values for
c
and
b
strongly depends
on the memory organisation of the target architecture. For example, in current
multicore processors,
b
= 32 is usually a convenient choice, while for GPUs larger
values are recommended (e.g.
b
= 128).
×
4 Implementations
We next present three routines to compute the symmetric band matrix-matrix
product on a CPU-GPU system. All the implementations intensively invoke ker-
nels from CUBLAS.