Efficient Symmetric Band Matrix-Matrix Multiplication on GPUs - High Performance Computing - page 10

Information Technology Reference

In-Depth Information

Table 2. Execution time (in seconds) for the symmetric band matrix-vector product

routines. The CPU-GPU communication times are not reported in this table. The

bandwidth is expressed as a percentage of the matrix dimension.

Matrix

Bandwidth

sbmv mkl

sbmv cublas

sbmv ms

Dimension

0.5%

0.002

0.001

0.002

12800

1.0%

0.004

0.002

0.003

2.0%

0.009

0.004

0.003

0.5%

0.006

0.005

0.005

25600

1.0%

0.012

0.008

0.006

2.0%

0.019

0.015

0.007

0.5%

0.013

0.010

0.009

38400

1.0%

0.023

0.019

0.011

2.0%

0.039

0.031

0.012

0.5%

0.023

0.018

0.012

51200

1.0%

0.055

0.030

0.013

2.0%

0.071

0.050

0.018

0.5%

0.035

0.027

0.017

64000

1.0%

0.056

0.045

0.018

2.0%

0.103

0.081

0.027

blocked variants, the cases with n =10and n = 20 present a similar execution

time. Onthe contrary,ascouldbe expected, sbmm mkl requires2

moretime, while

sbmm cublas and sbmm ms require approximately between 1.75 and 2

×

more time.

This is because in the GPU-based variants, although the computing time is dou-

bled, the data transfer time is similar. Thus, the total time is increased by a factor

lower but near to 2

×

×

.

As stated above, the sbmv ms routine is more e cient than the correspond-

ing kernel from CUBLAS. However the gains reported do not compensate the

overhead introduced by the higher volume of data transfer, and the mandatory

transform of A to the modified storage scheme. There are some applications

where several matrix-vector products have to be computed using the same ma-

trix. This is the case of iterative solvers of systems of linear equations such as

the Conjugate-Gradient method. In such applications, the matrix can be trans-

formed and transferred to the device once, and can be then successively re-used

at each iteration of the algorithm. Thus, the overhead introduced by data trans-

fers can be easily compensated after several steps iterations if the matrix-vector

routine is more ecient.

Table 2 shows the execution time required by the matrix-vector implementa-

tions without taking into account the time dedicated to the data transfers. The

sbmv mkl kernel is outperformed by both GPU-based routines and the sbmv ms

variant in particular obtains remarkable speed-ups. These results show that the

speed-up is higher for larger matrices. In this experimental evaluation, sbmv ms

reports an acceleration factor of up to 4

×

when compared with its MKL coun-

terpart, and up to 3

×

when compared with the CUBLAS routine.

Next Page

High Performance Computing

Search WWH ::

Custom Search

Home