Information Technology Reference
In-Depth Information
The main contribution of this paper is the introduction and evaluation of
two new GPU-based routines for the symmetric band matrix multiplication
(sbmm) that leverage the vast hardware parallelism of GPUs and the high data-
parallelism of this operation to significantly accelerate its computation. In par-
ticular, the experimental results for the accelerator-enabled codes collected on
an NVIDIA C2050 GPU and two Intel E5520 quad-processor, demonstrate su-
perior performance and scalability over a multithreaded CPU variant using the
Intel MKL (Math Kernel Library) and a GPU version based on CUBLAS.
The rest of the paper is structured as follows. In Section 2, we describe the
routines related to band matrix multiplication available in BLAS. Then, in Sec-
tion 3, we introduce a modified algorithm for the operation where most of the
computations are cast in terms of BLAS-3 kernels. We describe the two new
GPU-based versions that implement the aforementioned algorithm in Section 4.
We present experimental results in Section 5 and, finally, we discuss some con-
clusions and future work in Section 6.
2TheOp onin BLAS
The BLAS specification provides support to perform computations with sym-
metric band matrices. In particular, it includes a specific kernel called sbmv to
perform a matrix-vector product where the matrix is symmetric and banded.
In contrast, BLAS does not offer the equivalent kernel for the matrix-matrix
product operation when one of the matrices presents a band structure. This
matrix-matrix product can be easily implemented on top of sbmv routine. Con-
cretely, we can partition matrix B columnwise, and perform a matrix-vector
product with each column of B . This procedure can be written as:
C i = ʱAB i + ʲC i ,
(2)
where C i ,B i stand for the i -th columns of B and C respectively, and 1
n .
Although this simple approach allows us the use of BLAS kernels to execute
the complete operation, it is based on a level-2 BLAS routine, while the prod-
uct of matrices is a level-3 BLAS operation. Thus, with this implementation
the underlying architecture can not be eciently exploited. In particular, each
element of matrix A is accessed n times resulting in a suboptimal usage of the
memory hierarchy.
i
3 Algorithm sbmm BLK
Block-wise algorithms for linear algebra computations can eciently leverage
the memory hierarchy of current parallel computing architectures, resulting in
higher performances. In this context, we propose a blocked algorithm to perform
the matrix-matrix product when matrix A presents a symmetric band structure,
which is presented in Figures 2 and 3. Note that it only accesses the elements in
the lower triangular part of matrix A , in order to support the use of the packed
Search WWH ::




Custom Search