Optimizing OpenCL Kernels for the ARM Mali-T600 GPUs - GPU Pro: Advanced Rendering Techniques - page 356

Graphics Reference

In-Depth Information

Size

Blocked

Cache-Blocked

Median

High

Median

High

96

9.3

9.4

7.0

7.4

192

10.5

10.6

8.0

8.1

384

9.9

10.0

9.9

10.0

768

9.7

9.9

9.7

10.0

1440

9.6

9.9

9.1

9.3

2880

1.1

9.4

9.2

9.2

Tab l e 7. 4. PerformanceinGFLOPSfor1 × 4 × 4 blocked and cache-blocked NT im-

plementations. Each implementation was run three times for each matrix size and each

local work-group size. The median and best results from the nine runs are listed in the

table.

results with blocking versions probably appear in runs where we see no or very

little work-group divergence, whereas the much lower median result shows that

the typical case has lower performance.

Cache-blocking, introduced to stifle thread divergence, also has an effect in

preventing work-group divergence, thereby decreasing the variability.

The cost of a barrier. In our previous estimate of the cost of executing a barrier

instruction, we assumed that we would have one work-item per cycle entering

into the barrier, and then one work-item per cycle exiting. In reality, with thread

divergence, we do not have the ideal case of one work-item entering the barrier

every cycle, as the first work-item will be a few instructions ahead of the last

one. Instead, all work-items will have to wait for the last one to arrive at the

barrier. The actual cost of the barrier can therefore be significantly higher than

our estimate, if a few work-items of the work-group are far behind in executing

the program.

7.5.11 Summary

We started from scalar versions of a generic matrix multiplication and then trans-

formed the initial kernels, successively arriving at more elaborate implementa-

tions. We discussed the reason behind the transformations, and we discussed

how the transformations took advantage of aspects of the hardware. We started

by introducing vector operations, and then we focused on the memory system and

in particular on the L1 cache. We also discussed execution times, as measured on

an Arndale development board, and we found that the qualitative results were

as we expected, although a quantitative comparison shows that our simplifying

assumptions are not always satisfied.

Next Page

GPU Pro: Advanced Rendering Techniques

Search WWH ::

Custom Search

Home