Graphics Reference
In-Depth Information
// Compute contribution from third row.
43
lLoad
=
vload8
(0,
in
+(
offset
+
width
2+0));
44
mLoad
=
vload8
(0,
in
+(
offset
+
width
2+1));
45
rLoad
=
vload8
(0,
in
+(
offset
+
width
2+2));
46
47
48
lData
=
convert short8
(
lLoad
);
mData
=
convert short8
(
mLoad
);
49
rData
=
convert short8
(
rLoad
);
50
51
52
_dx1
+=
rData
−
lData
;
_dy1
−
=
rData
+
lData
+
mData
(
short8
)2;
53
_dx2
+= (
rData
−
lData
)
(
short8
)2;
54
_dx3
=
rData
−
lData
;
55
_dy3
=
rData
+
lData
+
mData
(
short8
)2;
56
// Store the results .
83
vstore8
(
convert char8
(
_dx1
>>
3), 0,
dx1
+
offset
+
width
+1);
84
vstore8
(
convert char8
(
_dy1
>>
3), 0,
dy1
+
offset
+
width
+1);
85
vstore8
(
convert char8
(
_dx2
>>
3), 0,
dx2
+
offset
+
width
2+1);
86
vstore8
(
convert char8
(
_dy2
>>
3), 0,
dy2
+
offset
+
width
2+1);
87
vstore8
(
convert char8
(
_dx3
>>
3), 0,
dx3
+
offset
+
width
3+1);
88
vstore8
(
convert char8
(
_dy3
>>
3), 0,
dy3
+
offset
+
width
3+1);
89
Listing 7.8.
Computing contribution from the third row:
3xchar8
.
7.5 Optimizing the General Matrix Multiplication
The Sobel filter implementations have hightlighted the importance of using vector
instructions and a high number of active work-items. We next study implementa-
tions of the general matrix multiplication (GEMM) to elucidate the importance
of using caches effectively. We first discuss aspects of the caches and how we op-
timize for them. At the end, we look at the runtimes on an Arndale development
board and compare to our discussions.
7.5.1 Algorithm
The general matrix multiplication is a function of the Basic Linear Algebra Sub-
programs (BLAS) API
11
that computes
C
=
αAB
+
βC,
where
A
,
B
,
C
are matrices of floating-point numbers and
α
,
β
are scalars.
7.5.2 Implementation Details
In our implementation, the matrices are
N
N
arrays of single-precision floating-
point numbers (SGEMM). We consider two common SGEMM variants:
×