Graphics Reference
In-Depth Information
// Compute contribution from third row.
41
load
=
vload16
(0,
in
+(
offset
+
width
2+0));
42
43
44
lData
=
convert short8
(
load
.
s01234567
);
mData
=
convert short8
(
load
.
s12345678
);
45
rData
=
convert short8
(
load
.
s23456789
);
46
47
48
_dx1
+=
rData
−
lData
;
_dy1
−
=
rData
+
lData
+
mData
(
short8
)2;
49
_dx2
+= (
rData
−
lData
)
(
short8
)2;
50
// Store the results .
62
vstore8
(
convert char8
(
_dx1
>>
3), 0,
dx1
+
offset
+
width
+1);
63
vstore8
(
convert char8
(
_dy1
>>
3), 0,
dy1
+
offset
+
width
+1);
64
vstore8
(
convert char8
(
_dx2
>>
3), 0,
dx2
+
offset
+
width
2+1);
65
vstore8
(
convert char8
(
_dy2
>>
3), 0,
dy2
+
offset
+
width
2+1);
66
Listing 7.7.
Computing contribution from the third row:
2xchar8_load16
.
Computing two rows of output.
The
2xchar8
and
2xchar8_load16
kernels load from
four input rows to compute results for two output rows (
n
=2). Theyare
partially shown in Listing 7.6 and Listing 7.7 and are modifications of the
char8
and
char8_load16
kernels, respectively. Both kernels can be launched with up to
128 simultaneous work-items per core and both kernels perform better than the
single-row variants. As before, the
load16
version is faster, and indeed achieves
the second best performance in this study.
Computing three rows of output.
The
3xchar8
kernel, partially shown in List-
ing 7.8, has grown too complex and can only be launched with up to 64 simul-
taneous work-items per core. Therefore, exploiting data reuse by keeping more
than two rows in registers is suboptimal on the Mali-T604.
7.4.8 Summary
We have presented several versions of the Sobel filter, and discussed their per-
formance characteristics on the Mali-T604 GPU. Vectorizing kernel code and
exploiting data reuse are the two principal optimization techniques explored in
this study. The fastest kernel,
char16_swizzle
, is nearly nine times faster than
the slowest kernel,
char
, which reiterates the importance of target-specific opti-
mizations for OpenCL code.
To summarize, we note that although the theoretical peak performance is
at the ratio of two arithmetic instruction words for every load-store instruction
word, the best performance was obtained in the versions with the highest number
of arithmetic words executed per cycle. Restructuring the program to trade load-
store operations for arithmetic operations has thus been successful, as long as the
kernel could still be launched with 128 simultaneous work-items per core.