Graphics Reference
In-Depth Information
This notation allows us to write a single iteration over k as
32 n +31
4 , 4 m +3
B [4 k +0 ,j ]
32 , 4 m +3
B [4 k +1 ,j ]
,
i =32 n
,
j =4 m
,
j =4 m
A [ i,k ]
×
×
×
32 ,
4 m +3
B [4 k +2 ,j ]
32 , 4 m +3
B [4 k +3 ,j ]
,
j =4 m
,
j =4 m
×
×
32 .
As a cache line has space for four float4 elements, we see that the reads from
A read the first quarter of 32 consecutive cache lines and the reads from B read
four full cache lines. To get full cache lines instead, we consider four consecutive
iterations in k together, and we see that those four iterations read 32 full cache
lines from A and 16 full cache lines from B . For the moment, we restrict ourselves
to considering a single work-group, and we note that these cache lines will never
be reused by later operations in the same work-group. We have now arrived at
our conclusion for the L1 cache requirements of the loop. If our L1 cache has
enough space for 48 cache lines, then we will never read the same value into the
L1 cache twice while executing the loop for all work-items in a work-group, as all
subsequent uses will be able to reuse the value that is stored in the cache.
After the loop has completed, the work-group additionally has to load and
store to C , which needs access to a 32
4 blocks of C , spanning
32 complete cache lines, meaning that (aslongasourL1cacheisatleast32
cache lines large) we will not see any lack of reuse for the elements of C within a
work-group.
If we continue to assume that we only have a single work-group at a time, and
consider the possibilities for cache reuse between consecutively scheduled work-
groups on the same core, we need to consider the state of the L1 cache when
work-group ( n,m ) finishes execution. The L1 cache contains 256 cache lines in
total, and the operations on C will have filled 32 of those, so 224 remain for A
and B . Each sequence of four iterations needs 48 cache lines, so the number of
iterations that have their cache lines still in cache at work-group completion is
4(256
×
×
4 block of 1
32) / 48, or 16, and this lets us see that the work-group size where we
may have reuse between work-groups is when we only need 16 sequences of four
iterations each in the loop, or 64 iterations, which corresponds to a matrix size
of 256
256 (as we chose Δ K reg . = 4). For larger matrices, no reuse between
consecutively scheduled work-groups on the same core is possible.
×
Arbitrary local work size. For an arbitrary local work size, we have two reasons to
redo the above analysis. First, we have the obvious reason that we get a different
interleaved read pattern between the work-items within a work-group. Second,
we can have more than one work-group simultaneously active on the same core,
if we choose a smaller local work size.
With a local work size of ( λ 0 1 ), we need to look at all work-groups that are
running simultaneously on the same core. If the total number of work-items on the
Search WWH ::




Custom Search