Information Technology Reference
In-Depth Information
Table 2. Execution time (in seconds) for the symmetric band matrix-vector product
routines. The CPU-GPU communication times are not reported in this table. The
bandwidth is expressed as a percentage of the matrix dimension.
Matrix
Bandwidth
sbmv mkl
sbmv cublas
sbmv ms
Dimension
0.5%
0.002
0.001
0.002
12800
1.0%
0.004
0.002
0.003
2.0%
0.009
0.004
0.003
0.5%
0.006
0.005
0.005
25600
1.0%
0.012
0.008
0.006
2.0%
0.019
0.015
0.007
0.5%
0.013
0.010
0.009
38400
1.0%
0.023
0.019
0.011
2.0%
0.039
0.031
0.012
0.5%
0.023
0.018
0.012
51200
1.0%
0.055
0.030
0.013
2.0%
0.071
0.050
0.018
0.5%
0.035
0.027
0.017
64000
1.0%
0.056
0.045
0.018
2.0%
0.103
0.081
0.027
blocked variants, the cases with n =10and n = 20 present a similar execution
time. Onthe contrary,ascouldbe expected, sbmm mkl requires2
moretime, while
sbmm cublas and sbmm ms require approximately between 1.75 and 2
×
more time.
This is because in the GPU-based variants, although the computing time is dou-
bled, the data transfer time is similar. Thus, the total time is increased by a factor
lower but near to 2
×
×
.
As stated above, the sbmv ms routine is more e cient than the correspond-
ing kernel from CUBLAS. However the gains reported do not compensate the
overhead introduced by the higher volume of data transfer, and the mandatory
transform of A to the modified storage scheme. There are some applications
where several matrix-vector products have to be computed using the same ma-
trix. This is the case of iterative solvers of systems of linear equations such as
the Conjugate-Gradient method. In such applications, the matrix can be trans-
formed and transferred to the device once, and can be then successively re-used
at each iteration of the algorithm. Thus, the overhead introduced by data trans-
fers can be easily compensated after several steps iterations if the matrix-vector
routine is more ecient.
Table 2 shows the execution time required by the matrix-vector implementa-
tions without taking into account the time dedicated to the data transfers. The
sbmv mkl kernel is outperformed by both GPU-based routines and the sbmv ms
variant in particular obtains remarkable speed-ups. These results show that the
speed-up is higher for larger matrices. In this experimental evaluation, sbmv ms
reports an acceleration factor of up to 4
×
when compared with its MKL coun-
terpart, and up to 3
×
when compared with the CUBLAS routine.
 
Search WWH ::




Custom Search