Graphics Reference
In-Depth Information
Size
sNN sNT
bNN bNT
cbNN cbNT
96
2.0
2.0
8.0
11.2
5.1
7.4
192
2.0
2.1
8.4
13.0
5.1
8.1
384
2.0
2.1
7.6
13.2
6.4
10.1
768
2.0
2.1
7.5
13.1
6.6
10.0
1440
2.0
2.1
7.4
13.1
6.5
9.3
2880
1.6
2.0
6.8
10.8
6.1
9.2
Tab l e 7. 3. Performance in GFLOPS for some variants and matrix sizes. For each
matrix size and kind of implementation, we have only selected one number, and it may,
e.g., use different blocking parameters at different sizes. The columns show the best
performance we see in GFLOPS for scalar (s), blocked (b), and cache-blocked (cb)
variants of non-transposed (NN) and transposed (NT) SGEMM implementations.
Overall trends. In Table 7.3, we show performance numbers for different kinds
of implementations and different matrix sizes. Variation within one kind is not
shown, and different rows in the same column can contain results for different
variants of the same kind (e.g., different blocking parameters or work-group sizes).
We see that the NT versions perform better than NN versions and that blocked
versions are better than scalar versions, as expected. While already the smallest
shown matrix size appears sucient for the scalar versions, the blocked and (more
strongly) the cache-blocked variants need larger amounts of work to reach their
best performance.
Our experiments with larger matrices, where performance is heavily influ-
enced by system effects like thread-divergence, are not shown in the table, due
to variations in results.
Cache-blocking. One surprising result in the table is the large difference between
the blocked and cache-blocked variants, where the introduction of cache-blocking
seems to come at a large cost. This impression is misleading and due to our
selection of data. We only display the best performance achieved for each kind
of implementation, and we saw in the Sobel results that the number of registers
plays a crucial role in determining performance. For the best blocked NT imple-
mentation we have, the corresponding cache-blocked versions use more registers,
whichpreventsusfromkeeping128simultaneous threads per core. Due to the
limited number of threads, this version is not our best cache-blocked implemen-
tation, as we have other variants that use fewer registers. The columns bNT and
cbNT therefore display results for different implementations. If we instead con-
sider (nearly) identical implementations of 1
4 blocked and cache-blocked NT
variants (shown in Table 7.4), we see much larger similarities in the top results,
and we also see the large difference in the median result for large matrices.
×
4
×
Work-group divergence. For large workloads, work-group divergence will start to
appear, and this will affect performance. The occasional very good performance
Search WWH ::




Custom Search