Graphics Reference
In-Depth Information
// Compute contribution from third row.
43
lLoad = vload8 (0, in +( offset + width ￿ 2+0));
44
mLoad = vload8 (0, in +( offset + width ￿ 2+1));
45
rLoad = vload8 (0, in +( offset + width ￿ 2+2));
46
47
48
lData = convert short8 ( lLoad );
mData = convert short8 ( mLoad );
49
rData = convert short8 ( rLoad );
50
51
52
_dx1 += rData
lData ;
_dy1
= rData + lData + mData ￿ ( short8 )2;
53
_dx2 += ( rData
lData ) ￿ ( short8 )2;
54
_dx3 = rData
lData ;
55
_dy3 = rData + lData + mData
( short8 )2;
56
// Store the results .
83
vstore8 ( convert char8 ( _dx1 >> 3), 0, dx1 + offset + width +1);
84
vstore8 ( convert char8 ( _dy1 >> 3), 0, dy1 + offset + width +1);
85
vstore8 ( convert char8 ( _dx2 >> 3), 0, dx2 + offset + width ￿ 2+1);
86
vstore8 ( convert char8 ( _dy2 >> 3), 0, dy2 + offset + width ￿ 2+1);
87
vstore8 ( convert char8 ( _dx3 >> 3), 0, dx3 + offset + width ￿ 3+1);
88
vstore8 ( convert char8 ( _dy3 >> 3), 0, dy3 + offset + width ￿ 3+1);
89
Listing 7.8. Computing contribution from the third row: 3xchar8 .
7.5 Optimizing the General Matrix Multiplication
The Sobel filter implementations have hightlighted the importance of using vector
instructions and a high number of active work-items. We next study implementa-
tions of the general matrix multiplication (GEMM) to elucidate the importance
of using caches effectively. We first discuss aspects of the caches and how we op-
timize for them. At the end, we look at the runtimes on an Arndale development
board and compare to our discussions.
7.5.1 Algorithm
The general matrix multiplication is a function of the Basic Linear Algebra Sub-
programs (BLAS) API 11 that computes
C = αAB + βC,
where A , B , C are matrices of floating-point numbers and α , β are scalars.
7.5.2 Implementation Details
In our implementation, the matrices are N
N arrays of single-precision floating-
point numbers (SGEMM). We consider two common SGEMM variants:
×
11 http://www.netlib.org/blas
Search WWH ::




Custom Search