Optimizing OpenCL Kernels for the ARM Mali-T600 GPUs - GPU Pro: Advanced Rendering Techniques - page 355

Graphics Reference

In-Depth Information

Size

sNN sNT

bNN bNT

cbNN cbNT

96

2.0

2.0

8.0

11.2

5.1

7.4

192

2.0

2.1

8.4

13.0

5.1

8.1

384

2.0

2.1

7.6

13.2

6.4

10.1

768

2.0

2.1

7.5

13.1

6.6

10.0

1440

2.0

2.1

7.4

13.1

6.5

9.3

2880

1.6

2.0

6.8

10.8

6.1

9.2

Tab l e 7. 3. Performance in GFLOPS for some variants and matrix sizes. For each

matrix size and kind of implementation, we have only selected one number, and it may,

e.g., use different blocking parameters at different sizes. The columns show the best

performance we see in GFLOPS for scalar (s), blocked (b), and cache-blocked (cb)

variants of non-transposed (NN) and transposed (NT) SGEMM implementations.

Overall trends. In Table 7.3, we show performance numbers for different kinds

of implementations and different matrix sizes. Variation within one kind is not

shown, and different rows in the same column can contain results for different

variants of the same kind (e.g., different blocking parameters or work-group sizes).

We see that the NT versions perform better than NN versions and that blocked

versions are better than scalar versions, as expected. While already the smallest

shown matrix size appears sucient for the scalar versions, the blocked and (more

strongly) the cache-blocked variants need larger amounts of work to reach their

best performance.

Our experiments with larger matrices, where performance is heavily influ-

enced by system effects like thread-divergence, are not shown in the table, due

to variations in results.

Cache-blocking. One surprising result in the table is the large difference between

the blocked and cache-blocked variants, where the introduction of cache-blocking

seems to come at a large cost. This impression is misleading and due to our

selection of data. We only display the best performance achieved for each kind

of implementation, and we saw in the Sobel results that the number of registers

plays a crucial role in determining performance. For the best blocked NT imple-

mentation we have, the corresponding cache-blocked versions use more registers,

whichpreventsusfromkeeping128simultaneous threads per core. Due to the

limited number of threads, this version is not our best cache-blocked implemen-

tation, as we have other variants that use fewer registers. The columns bNT and

cbNT therefore display results for different implementations. If we instead con-

sider (nearly) identical implementations of 1

4 blocked and cache-blocked NT

variants (shown in Table 7.4), we see much larger similarities in the top results,

and we also see the large difference in the median result for large matrices.

×

4

×

Work-group divergence. For large workloads, work-group divergence will start to

appear, and this will affect performance. The occasional very good performance

Next Page

GPU Pro: Advanced Rendering Techniques

Search WWH ::

Custom Search

Home