Graphics Reference
In-Depth Information
core is 128 (which seemed optimal in the Sobel study), and if
λ
0
λ
1
= 128, then we
have only a single work-group on the core, but we could have chosen
λ
0
=
λ
1
=4,
which would give us 128
/
16 = 8 work-groups executing simultaneously on a core.
As before, it will be beneficial to look at the cache usage over four iterations
over
k
, and we can easily generalize the results we had before to see that a single
work-group reads
λ
1
full cache lines from
A
and 4
λ
0
full cache lines from
B
for
every four iterations (provided that
λ
0
is a multiple of 4).
If
λ
0
λ
1
≤
64, we have more than one work-group executing simultaneously
on the core. In this case, the work-groups that are simultaneously active on a
code will have consecutive values of
m
and identical values of
n
.Weseethat
the reads from
A
read from the same cache lines, so they are reused between the
work-groups. We said above that a few work-groups are sent to each core, and
we assume that the work-groups we are having active at the same time belong to
this set, as we would otherwise not have consecutive group IDs (
m, n
).
18
This means that the 128 work-items executing simultaneously on one core use
λ
1
+4
λ
0
[number of work-groups] =
λ
1
+4
λ
0
128
λ
0
λ
1
=
λ
1
+ 512
/λ
1
cache lines from the L1 cache for four consecutive iterations in
k
. Asthisex-
pression is independent of
λ
0
, we can select our
λ
0
freely (as long as it is a
multiple of 4), and the only effect we see (from our analysis so far) is that a
larger
λ
0
restricts our possible choices for
λ
1
. With
λ
1
=1
,
2
,
4
,
8
,
16
,
32
,
64
,
128,
weseethatwerequire513
,
258
,
132
,
72
,
48
,
48
,
72
,
132 cache lines, and with room
for 256 lines in the L1 cache of each core, the fraction of L1 we need to use is
19
2
.
0
,
1
.
0
,
0
.
52
,
0
.
28
,
0
.
19
,
0
.
19
,
0
.
28
,
0
.
52. Under our assumptions of fully associa-
tive cache and perfect execution order between work-items, we would expect all
options with a value below 1 to have the same performance (disregarding the
effects of L2 and RAM).
As we know that our assumptions are incorrect, though, we need to discuss
what happens when executing on a real GPU. First, due to the design of the
cache, we will see cache misses before the cache is 100% filled, i.e., earlier than our
analysis above would have predicted. The more complicated aspect of execution
is that the work-items that are spawned in the order we describe here do not
keep the order. When one work-item is stalled on a cache miss, other work-items
may overtake it, so we will have active work-items that are executing different
iterations (different values of
k
) at the same time. We refer to this as thread
divergence (or work-item divergence), and the fraction of L1 we need is a measure
of our robustness to keep having good performance in cases of thread divergence.
Thread divergence always happens and is dicult to measure and quantify, but
18
With work-group divergence, i.e., with a few work-items each from many work-groups
partially finished on the same core, we might have work-groups with very different
group_id
s
simultaneously active on the same core.
19
The numbers are also shown in Table 7.2.