Efficient Symmetric Band Matrix-Matrix Multiplication on GPUs - High Performance Computing

Information Technology Reference

In-Depth Information

A11

A21

A31

Padding

Fig. 5. Modified storage scheme for symmetric band matrices and how it is accessed

in the inner loop from sbmm BLK

Figure 5 shows the modified storage scheme and how it is accessed during an

inner loop iteration in sbmm BLK . The strictly lower triangular part of A 31 is

now conveniently placed in the added rows. Consequently, blocks A 21 and A 31

can be merged, and so can the operations they are involved in. Thus, the updates

performed at each step of the inner loop can be reformulated as:

E 1 := E 1 + A 11 ·

D 1 ,

E 1 := E 1 + A 21 A 31 ·

D 1

E 2

E 3

:= E 2

E 3

+ A 21

A 31

D 2

D 3

This approach presents two main advantages:

- The number of invocations to CUBLAS kernels is reduced from 8 to 3 per

step and, consequently, the overhead introduced by the kernels invocations

is also reduced.

- It eliminates the invocations to kernels with a moderate to low cost, which

can not exploit the massively parallel architecture of the GPU. Concretely,

the operations that dissapear involve triangular matrices and present load-

balancing problems.

There are also some drawbacks related to this implementation. First of all, the

memory requirements are enlarged. In addition, the number of arithmetic opera-

tions is also increased, as it operates with the null elements in A 11 and A 31 .

4.3

Implementation sbmv ms

Additionally, we implemented a symmetric banded matrix-vector product vari-

ant based on the modified storage scheme. Due to the different storage scheme,

this variant may be slightly faster than the implementation from CUBLAS.The

overhead introduced by transforming A to the modified storage can be relatively

high, in principle higher than the gain that sbmv ms introduces with respect to

Search WWH ::

Custom Search

Home