Information Technology Reference
In-Depth Information
Algorithm:
[
E
]:=sbmm
BLK inner
(
E,A,D,k
)
⊛
⊝
⊞
⊠
,Aₒ
⊛
⊝
⊞
⊠
,Dₒ
⊛
⊝
⊞
⊠
E
T
E
M
E
B
A
TL
A
ML
A
MR
A
BR
D
T
D
M
D
B
Partition
E ₒ
where
E
T
,D
T
have 0 elements;
A
TL
is 0
×
0and
E
M
,
A
ML
have
k
rows
while
m
(
E
T
)
<m
(
E
)
do
Determine block size
b
Repartition
⊛
⊝
⊞
⊠
⊛
⊝
⊞
⊠
⊛
⊝
⊞
⊠
E
0
E
1
E
2
E
3
E
4
A
00
A
10
D
D
D
D
D
4
⊛
⊝
⊞
⊠
ₒ
⊛
⊝
⊞
⊠
ₒ
⊛
⊝
⊞
⊠
ₒ
E
T
E
M
E
B
A
TL
A
ML
A
MR
A
BR
D
T
D
M
D
B
A
11
,
A
20
A
21
A
22
A
31
A
32
A
42
,
where
D
1
has
b
rows;
E
1
has
b
rows;
E
3
has 0 rows if
m
(
D
0
)
>
(
n
(
A
)
− k −
1) and has
b
rows otherwise;
A
11
is
b × b
;
A
31
is empty if
m
(
D
0
)
>
(
n
(
A
)
−
k
−
1) and is
b
×
b
otherwise
E
1
:=
E
1
+
A
11
·
D
1
E
1
:=
E
1
+
A
T
21
· D
2
E
1
:=
E
1
+
A
T
31
· D
3
E
2
:=
E
2
+
A
21
· D
1
E
3
:=
E
3
+
A
31
· D
1
Continue with
⊛
⊝
⊞
⊠
⊛
⊝
⊞
⊠
⊛
⊝
⊞
⊠
E
0
E
1
E
2
E
3
E
4
A
00
A
10
A
11
A
20
A
21
D
D
D
D
D
4
⊛
⊝
⊞
⊠
ₐ
⊛
⊝
⊞
⊠
ₐ
⊛
⊝
⊞
⊠
ₐ
E
T
E
M
E
B
A
TL
A
ML
A
MR
A
BR
D
T
D
M
D
B
,
A
22
,
A
31
A
32
A
42
endwhile
Fig. 3.
Inner loop of Algorithm sbmm
BLK
that computes
C
:=
A · B
+
C
4.1
Implementation
sbmm
blk
Routine sbmm
blk
is an implementation of Algorithm sbmm
BLK
with all the
computations performed via the appropriate CUBLAS kernel. Initially, all the
matrices are sent totheGPU.Then,
ʲC
is computed via the corresponding
CUBLAS routine. This is not a BLAS-3 operation, but it presents a relatively
small computational cost and can be eciently computed on the accelerator.
Next, the operations in sbmm
BLK
are executed; and finally
C
is transferred
back to the CPU. The update of
C
1
requires three matrix-matrix products: one
involving a symmetric matrix (block
A
11
), a second with an upper triangular
matrix (block
A
31
); and the last one with two general matrices. CUBLAS pro-
vides specific routines for all these operations. Blocks
C
2
and
C
3
are updated via
a product of two general matrices and a product of an upper triangular matrix
times a general matrix, respectively.
The use of CUBLAS routines also presents a drawback, as the kernel that
implements the product of a triangular matrix times a general matrix (routine
trmm) indeed computes
C
=
ʱop
(
A
)
B,
(3)