Optimizing OpenCL Kernels for the ARM Mali-T600 GPUs - GPU Pro: Advanced Rendering Techniques - page 334

Graphics Reference

In-Depth Information

char

1

256

13

7

3.9

2.1

0.3

char8

8

128

17

9

3.8

2

1.8

16

64

21

7

2.3

0.8

1.8

char16

8

128

18

5

5 .42.2

char8_load16

16

128

32

8

5.1

1.3

2.6

char16_swizzle

16

128

29

14

3.8

1.9

2.1

2xchar8

16

128

32

7

4.9

1.1

2.5

2xchar8_load16

24

64

39

19

2.2

1.1

1.4

3xchar8

Tab l e 7. 1. Performance characteristics of all Sobel versions running on a 512 × 512 image

(see Section 7.4.3 for a description of the columns). Note that our implementations

compute two output pixels for each input pixel (i.e., double the output of a standard 3 × 3

convolution filter), but we only count the number of input pixels per work-item (WI).

7.4.5 Using Vectors

Eight components. Each work-item of the char8 kernel in Listing 7.2 performs

eight char8 load operations to read a 3 × 10 region of input pixels and computes

two char8 vectors of output pixels. The conversion and compute instructions

operate on short8 data in 128-bit registers.

Table 7.1 shows that while the char8 kernel computes eight times more data, it

performs only about 30% more arithmetic and memory instructions than the char

kernel, resulting in a six times increase in performance (1.8 pixels per cycle). The

ratio between the arithmetic and memory instructions is close to the 2:1, optimal

for cases with few cache misses. Due to the increase in complexity, the kernel

max(LWS) is 128, limiting the number of simultaneously active work-items per

core to 128. Still, the number of instruction words executed per cycle is nearly

the same as for the scalar version, which shows that we can accept a max(LWS)

of 128 without significant performance problems for this kind of workload.

Sixteen components. The number of operations per pixel is reduced even further

by using char16 memory operations (full vector register width) and short16 arith-

metic operations (broken into short8 operations by the compiler) in the char16

kernel partially shown in Listing 7.3. Performance, however, does not increase,

because this kernel can only be executed with up to 64 simultaneous work-items

per core (max(LWS) = 64), which introduces bubbles into the pipelines due to

reduced latency-hiding capability.

Next Page

GPU Pro: Advanced Rendering Techniques

Search WWH ::

Custom Search

Home