Information Technology Reference
In-Depth Information
the CUBLAS routine. However, there exist some applications where the same
matrix is involved in several matrix-vector products. In those applications, it
may pay off to accelerate (even slightly) the matrix-vector products at the price
of an increase in the storage space required to keep the matrix. This is, for
example, the case for iterative solvers such as the Conjugate-Gradient method.
5 Experimental Evaluation
We evaluate the performance of the band matrix multiplication implementations
presented in the previous section, namely the sbmm blk routine built upon the
BLAS packed storage format, and sbmm blk + ms that supports the modified stor-
age. Additionally we include in the evaluation three implementations based on
the band matrix-vector product: two based on the routines provided by MKL
(sbmm mkl )andCUBLAS (sbmm cublas ), and one more based on routine sbmv ms
(sbmm ms ).
We computed the products where the dimension of A varies from m =12 , 800
to m =64 , 000. Three bandwidths were tested for each dimension, k =0 . 5, 1, and
2% of m . Matrices B, C featured a reduced number of columns: n =1 , 10 , 20.
The first one corresponds in fact to a matrix-vector product.
The evaluation was performed in a platform eqquiped with an nVIDIA C2050
and two Intel Xeon E5520 quad-core processors. The implementations per-
form a heavy use of routines from libraries Intel MKL v9.293 and nVIDIA
CUBLAS v5.5. They were compiled with gcc v4.1.2. Finally, all the experiments
were performed using IEEE double-precision real arithmetic.
Table 1 shows the execution time for all the routines evaluated. All the results
include the time required to transfer data between the CPU and the GPU memo-
ries. If matrices B and C present a single column, meaning that we are computing
a matrix-vector product, then sbmm mkl obtains the best performance. The rea-
son is that the computational cost of the operation is too low, and the larger
parallelism of the GPU can not compensate for the communication time. The
fastest routine in the device is sbmm cublas when the dimension of A is medium
to small. For larger products, e.g. when m> 38 , 400 and k ≥ 2%, sbmv ms
outperforms sbmm cublas .
As soon as n becomes larger, the blocked variants become more ecient. In
particular, with n = 10, sbmm blk attains the best execution times. Only the
performance of sbmm blk + ms is comparable but always lower to that of sbmm blk .
This is due to the extra memory required to store A using the modified storage
scheme and the subsequent overhead when the matrix is transferred to the GPU.
Note that in most of the cases, sbmm ms outperforms the sbmm cublas implemen-
tation. Only for the smallest values of m and when the bandwidth k< 2%,
the CUBLAS routine outperforms sbmm ms . This is because the matrix-vector
kernel in sbmm ms is more ecient than its counterpart in CUBLAS, and it can
compensate the overhead introduced by the extra data transfers required by the
modified storage.
This behaviour is reinforced when n = 20. The blocked variants deliver the
bestperformancesevenforthesmallesttest-casesevaluated.Additionally,withthe
 
Search WWH ::




Custom Search