Efficient Symmetric Band Matrix-Matrix Multiplication on GPUs - High Performance Computing

Information Technology Reference

In-Depth Information

6 Concluding Remarks

We have addressed the computation of the symmetric band matrix-matrix mul-

tiplication on CPU-GPU platforms. Exploiting the structure of the matrix yields

relevant savings in both memory and computational cost. Two specific implemen-

tations, sbmm blk and sbmm blk + ms , are presented and evaluated. Both routines

leverage the parallelism of the target architecture to deliver remarkable per-

formance. Routine sbmm blk adopts the packed storage scheme defined in BLAS

and, consequently, presents some drawbacks that limit its performance in parallel

hardware architectures. Routine sbmm blk + ms partially overcomes these problems

by relying on a modified packed storage scheme which is more suitable for the

underlying architecture, at the cost of a minor increase in the memory require-

ments. The experimental evaluation shows remarkable gains of both routines

over the naive implementations based on the kernels in MKL and CUBLAS for

this operation.

Additionally, we have developed a symmetric band matrix-vector routine,

sbmv ms , that exploits the benefits from the modified storage scheme revealed by

sbmm blk + ms . Specifically, this new routine renders higher performance than its

counterpart from CUBLAS. Although our solution requires an additional effort

to transform the matrix to the modified storage scheme, we believe it may be

useful in methods that perform several symmetric band matrix-vector products

involving the same matrix such as, e.g., iterative Krylov subspace-based solvers

for symmetric band linear systems.

Acknowledgements. Ernesto Dufrechou and Pablo Ezzatti acknowledge sup-

port from Programa de Desarrollo de las Ciencias Basicas, and Agencia Na-

cional de Investigacion e Innovacion, Uruguay. Enrique S. Quintana-Ort´ıwas

supported by project TIN2011-23283 of the Ministry of Science and Competi-

tiveness (MINECO) and EU FEDER, and project P1-1B2013-20 of the Fundacio

Caixa Castello-Bancaixa and UJI.

References

1. Anderson, E., Bai, Z., Demmel, J., Dongarra, J.E., DuCroz, J., Greenbaum, A.,

Hammarling, S., McKenney, A.E., Ostrouchov, S., Sorensen, D.: LAPACK Users'

Guide. SIAM, Philadelphia (1992)

2. Benner, P., Dufrechou, E., Ezzatti, P., Igounet, P., Quintana-Ortı, E.S., Remon, A.:

Accelerating band linear algebra operations on gPUs with application in model reduc-

tion. In: Murgante, B., Misra, S., Rocha, A.M.A.C., Torre, C., Rocha, J.G., Falcao,

M.I., Taniar, D., Apduhan, B.O., Gervasi, O. (eds.) ICCSA 2014, Part VI. LNCS,

vol. 8584, pp. 386-400. Springer, Heidelberg (2014)

3. Cuthill, E., McKee, J.: Reducing the bandwidth of sparse symmetric matrices. In:

Proceedings of the 1969 24th National Conference, ACM 1969, pp. 157-172. ACM,

New York (1969)

Search WWH ::

Custom Search

Home