Graphics Reference
In-Depth Information
Size
Blocked
Cache-Blocked
Median
High
Median
High
96
9.3
9.4
7.0
7.4
192
10.5
10.6
8.0
8.1
384
9.9
10.0
9.9
10.0
768
9.7
9.9
9.7
10.0
1440
9.6
9.9
9.1
9.3
2880
1.1
9.4
9.2
9.2
Tab l e 7. 4. PerformanceinGFLOPSfor1 × 4 × 4 blocked and cache-blocked NT im-
plementations. Each implementation was run three times for each matrix size and each
local work-group size. The median and best results from the nine runs are listed in the
table.
results with blocking versions probably appear in runs where we see no or very
little work-group divergence, whereas the much lower median result shows that
the typical case has lower performance.
Cache-blocking, introduced to stifle thread divergence, also has an effect in
preventing work-group divergence, thereby decreasing the variability.
The cost of a barrier. In our previous estimate of the cost of executing a barrier
instruction, we assumed that we would have one work-item per cycle entering
into the barrier, and then one work-item per cycle exiting. In reality, with thread
divergence, we do not have the ideal case of one work-item entering the barrier
every cycle, as the first work-item will be a few instructions ahead of the last
one. Instead, all work-items will have to wait for the last one to arrive at the
barrier. The actual cost of the barrier can therefore be significantly higher than
our estimate, if a few work-items of the work-group are far behind in executing
the program.
7.5.11 Summary
We started from scalar versions of a generic matrix multiplication and then trans-
formed the initial kernels, successively arriving at more elaborate implementa-
tions. We discussed the reason behind the transformations, and we discussed
how the transformations took advantage of aspects of the hardware. We started
by introducing vector operations, and then we focused on the memory system and
in particular on the L1 cache. We also discussed execution times, as measured on
an Arndale development board, and we found that the qualitative results were
as we expected, although a quantitative comparison shows that our simplifying
assumptions are not always satisfied.
Search WWH ::




Custom Search