Graphics Reference
In-Depth Information
char
1
256
13
7
3.9
2.1
0.3
char8
8
128
17
9
3.8
2
1.8
16
64
21
7
2.3
0.8
1.8
char16
8
128
18
5
5 .42.2
char8_load16
16
128
32
8
5.1
1.3
2.6
char16_swizzle
16
128
29
14
3.8
1.9
2.1
2xchar8
16
128
32
7
4.9
1.1
2.5
2xchar8_load16
24
64
39
19
2.2
1.1
1.4
3xchar8
Tab l e 7. 1. Performance characteristics of all Sobel versions running on a 512 × 512 image
(see Section 7.4.3 for a description of the columns). Note that our implementations
compute two output pixels for each input pixel (i.e., double the output of a standard 3 × 3
convolution filter), but we only count the number of input pixels per work-item (WI).
7.4.5 Using Vectors
Eight components. Each work-item of the char8 kernel in Listing 7.2 performs
eight char8 load operations to read a 3 × 10 region of input pixels and computes
two char8 vectors of output pixels. The conversion and compute instructions
operate on short8 data in 128-bit registers.
Table 7.1 shows that while the char8 kernel computes eight times more data, it
performs only about 30% more arithmetic and memory instructions than the char
kernel, resulting in a six times increase in performance (1.8 pixels per cycle). The
ratio between the arithmetic and memory instructions is close to the 2:1, optimal
for cases with few cache misses. Due to the increase in complexity, the kernel
max(LWS) is 128, limiting the number of simultaneously active work-items per
core to 128. Still, the number of instruction words executed per cycle is nearly
the same as for the scalar version, which shows that we can accept a max(LWS)
of 128 without significant performance problems for this kind of workload.
Sixteen components. The number of operations per pixel is reduced even further
by using char16 memory operations (full vector register width) and short16 arith-
metic operations (broken into short8 operations by the compiler) in the char16
kernel partially shown in Listing 7.3. Performance, however, does not increase,
because this kernel can only be executed with up to 64 simultaneous work-items
per core (max(LWS) = 64), which introduces bubbles into the pipelines due to
reduced latency-hiding capability.
Search WWH ::




Custom Search