Efficient Symmetric Band Matrix-Matrix Multiplication on GPUs - High Performance Computing - page 5

Information Technology Reference

In-Depth Information

Algorithm: [ E ]:=sbmm BLK inner ( E,A,D,k )

⊛

⊝

⊞

⊠ ,Aₒ

⊛

⊝

⊞

⊠ ,Dₒ

⊛

⊝

⊞

⊠

E T

E M

E B

A TL

A ML A MR

A BR

D T

D M

D B

Partition E ₒ

where E T ,D T have 0 elements; A TL is 0 × 0and E M , A ML have k rows

while m ( E T ) <m ( E ) do

Determine block size b

Repartition

⊛

⊝

⊞

⊠

⊛

⊝

⊞

⊠

⊛

⊝

⊞

⊠

E 0

E 1

E 2

E 3

E 4

A 00

A 10

D D D D D 4

⊛

⊝

⊞

⊠ ₒ

⊛

⊝

⊞

⊠ ₒ

⊛

⊝

⊞

⊠ ₒ

E T

E M

E B

A TL

A ML A MR

A BR

D T

D M

D B

A 11

,

A 20

A 21 A 22

A 31 A 32

A 42

,

where

D 1 has b rows;

E 1 has b rows;

E 3 has 0 rows if m ( D 0 ) > ( n ( A ) − k − 1) and has b rows otherwise;

A 11

is b × b ;

A 31

is empty if m ( D 0 ) > ( n ( A )

−

k

−

1) and is b

×

b otherwise

E 1 := E 1 + A 11 ·

D 1

E 1 := E 1 + A T

21 · D 2

E 1 := E 1 + A T

31 · D 3

E 2 := E 2 + A 21 · D 1

E 3 := E 3 + A 31 · D 1

Continue with

⊛

⊝

⊞

⊠

⊛

⊝

⊞

⊠

⊛

⊝

⊞

⊠

E 0

E 1

E 2

E 3

E 4

A 00

A 10 A 11

A 20 A 21

D D D D D 4

⊛

⊝

⊞

⊠ ₐ

⊛

⊝

⊞

⊠ ₐ

⊛

⊝

⊞

⊠ ₐ

E T

E M

E B

A TL

A ML A MR

A BR

D T

D M

D B

,

A 22

,

A 31

A 32

A 42

endwhile

Fig. 3. Inner loop of Algorithm sbmm BLK that computes C := A · B + C

4.1

Implementation sbmm blk

Routine sbmm blk is an implementation of Algorithm sbmm BLK with all the

computations performed via the appropriate CUBLAS kernel. Initially, all the

matrices are sent totheGPU.Then, ʲC is computed via the corresponding

CUBLAS routine. This is not a BLAS-3 operation, but it presents a relatively

small computational cost and can be eciently computed on the accelerator.

Next, the operations in sbmm BLK are executed; and finally C is transferred

back to the CPU. The update of C 1 requires three matrix-matrix products: one

involving a symmetric matrix (block A 11 ), a second with an upper triangular

matrix (block A 31 ); and the last one with two general matrices. CUBLAS pro-

vides specific routines for all these operations. Blocks C 2 and C 3 are updated via

a product of two general matrices and a product of an upper triangular matrix

times a general matrix, respectively.

The use of CUBLAS routines also presents a drawback, as the kernel that

implements the product of a triangular matrix times a general matrix (routine

trmm) indeed computes

C = ʱop ( A ) B,

(3)

Next Page

High Performance Computing

Search WWH ::

Custom Search

Home