Efficient Symmetric Band Matrix-Matrix Multiplication on GPUs - High Performance Computing

Information Technology Reference

In-Depth Information

The main contribution of this paper is the introduction and evaluation of

two new GPU-based routines for the symmetric band matrix multiplication

(sbmm) that leverage the vast hardware parallelism of GPUs and the high data-

parallelism of this operation to significantly accelerate its computation. In par-

ticular, the experimental results for the accelerator-enabled codes collected on

an NVIDIA C2050 GPU and two Intel E5520 quad-processor, demonstrate su-

perior performance and scalability over a multithreaded CPU variant using the

Intel MKL (Math Kernel Library) and a GPU version based on CUBLAS.

The rest of the paper is structured as follows. In Section 2, we describe the

routines related to band matrix multiplication available in BLAS. Then, in Sec-

tion 3, we introduce a modified algorithm for the operation where most of the

computations are cast in terms of BLAS-3 kernels. We describe the two new

GPU-based versions that implement the aforementioned algorithm in Section 4.

We present experimental results in Section 5 and, finally, we discuss some con-

clusions and future work in Section 6.

2TheOp onin BLAS

The BLAS specification provides support to perform computations with sym-

metric band matrices. In particular, it includes a specific kernel called sbmv to

perform a matrix-vector product where the matrix is symmetric and banded.

In contrast, BLAS does not offer the equivalent kernel for the matrix-matrix

product operation when one of the matrices presents a band structure. This

matrix-matrix product can be easily implemented on top of sbmv routine. Con-

cretely, we can partition matrix B columnwise, and perform a matrix-vector

product with each column of B . This procedure can be written as:

C i = ʱAB i + ʲC i ,

(2)

where C i ,B i stand for the i -th columns of B and C respectively, and 1

n .

Although this simple approach allows us the use of BLAS kernels to execute

the complete operation, it is based on a level-2 BLAS routine, while the prod-

uct of matrices is a level-3 BLAS operation. Thus, with this implementation

the underlying architecture can not be eciently exploited. In particular, each

element of matrix A is accessed n times resulting in a suboptimal usage of the

memory hierarchy.

≤

i

≤

3 Algorithm sbmm BLK

Block-wise algorithms for linear algebra computations can eciently leverage

the memory hierarchy of current parallel computing architectures, resulting in

higher performances. In this context, we propose a blocked algorithm to perform

the matrix-matrix product when matrix A presents a symmetric band structure,

which is presented in Figures 2 and 3. Note that it only accesses the elements in

the lower triangular part of matrix A , in order to support the use of the packed

High Performance Computing

Search WWH ::

Custom Search

Home