Graphics Reference
In-Depth Information
char
1
256
13
7
3.9
2.1
0.3
char8
8
128
17
9
3.8
2
1.8
16
64
21
7
2.3
0.8
1.8
char16
8
128
18
5
5 .42.2
char8_load16
16
128
32
8
5.1
1.3
2.6
char16_swizzle
16
128
29
14
3.8
1.9
2.1
2xchar8
16
128
32
7
4.9
1.1
2.5
2xchar8_load16
24
64
39
19
2.2
1.1
1.4
3xchar8
Tab l e 7. 1.
Performance characteristics of all Sobel versions running on a 512
×
512 image
(see Section 7.4.3 for a description of the columns). Note that our implementations
compute two output pixels for each input pixel (i.e., double the output of a standard 3
×
3
convolution filter), but we only count the number of input pixels per work-item (WI).
7.4.5 Using Vectors
Eight components.
Each work-item of the
char8
kernel in Listing 7.2 performs
eight
char8
load operations to read a 3
×
10 region of input pixels and computes
two
char8
vectors of output pixels. The conversion and compute instructions
operate on
short8
data in 128-bit registers.
Table 7.1 shows that while the
char8
kernel computes eight times more data, it
performs only about 30% more arithmetic and memory instructions than the
char
kernel, resulting in a six times increase in performance (1.8 pixels per cycle). The
ratio between the arithmetic and memory instructions is close to the 2:1, optimal
for cases with few cache misses. Due to the increase in complexity, the kernel
max(LWS) is 128, limiting the number of simultaneously active work-items per
core to 128. Still, the number of instruction words executed per cycle is nearly
the same as for the scalar version, which shows that we can accept a max(LWS)
of 128 without significant performance problems for this kind of workload.
Sixteen components.
The number of operations per pixel is reduced even further
by using
char16
memory operations (full vector register width) and
short16
arith-
metic operations (broken into
short8
operations by the compiler) in the
char16
kernel partially shown in Listing 7.3. Performance, however, does not increase,
because this kernel can only be executed with up to 64 simultaneous work-items
per core (max(LWS) = 64), which introduces bubbles into the pipelines due to
reduced latency-hiding capability.