Graphics Reference
In-Depth Information
core is 128 (which seemed optimal in the Sobel study), and if λ 0 λ 1 = 128, then we
have only a single work-group on the core, but we could have chosen λ 0 = λ 1 =4,
which would give us 128 / 16 = 8 work-groups executing simultaneously on a core.
As before, it will be beneficial to look at the cache usage over four iterations
over k , and we can easily generalize the results we had before to see that a single
work-group reads λ 1 full cache lines from A and 4 λ 0 full cache lines from B for
every four iterations (provided that λ 0 is a multiple of 4).
If λ 0 λ 1
64, we have more than one work-group executing simultaneously
on the core. In this case, the work-groups that are simultaneously active on a
code will have consecutive values of m and identical values of n .Weseethat
the reads from A read from the same cache lines, so they are reused between the
work-groups. We said above that a few work-groups are sent to each core, and
we assume that the work-groups we are having active at the same time belong to
this set, as we would otherwise not have consecutive group IDs ( m, n ). 18
This means that the 128 work-items executing simultaneously on one core use
λ 1 +4 λ 0 [number of work-groups] = λ 1 +4 λ 0 128
λ 0 λ 1 = λ 1 + 512 1
cache lines from the L1 cache for four consecutive iterations in k . Asthisex-
pression is independent of λ 0 , we can select our λ 0 freely (as long as it is a
multiple of 4), and the only effect we see (from our analysis so far) is that a
larger λ 0 restricts our possible choices for λ 1 . With λ 1 =1 , 2 , 4 , 8 , 16 , 32 , 64 , 128,
weseethatwerequire513 , 258 , 132 , 72 , 48 , 48 , 72 , 132 cache lines, and with room
for 256 lines in the L1 cache of each core, the fraction of L1 we need to use is 19
2 . 0 , 1 . 0 , 0 . 52 , 0 . 28 , 0 . 19 , 0 . 19 , 0 . 28 , 0 . 52. Under our assumptions of fully associa-
tive cache and perfect execution order between work-items, we would expect all
options with a value below 1 to have the same performance (disregarding the
effects of L2 and RAM).
As we know that our assumptions are incorrect, though, we need to discuss
what happens when executing on a real GPU. First, due to the design of the
cache, we will see cache misses before the cache is 100% filled, i.e., earlier than our
analysis above would have predicted. The more complicated aspect of execution
is that the work-items that are spawned in the order we describe here do not
keep the order. When one work-item is stalled on a cache miss, other work-items
may overtake it, so we will have active work-items that are executing different
iterations (different values of k ) at the same time. We refer to this as thread
divergence (or work-item divergence), and the fraction of L1 we need is a measure
of our robustness to keep having good performance in cases of thread divergence.
Thread divergence always happens and is dicult to measure and quantify, but
18 With work-group divergence, i.e., with a few work-items each from many work-groups
partially finished on the same core, we might have work-groups with very different group_id s
simultaneously active on the same core.
19 The numbers are also shown in Table 7.2.
Search WWH ::




Custom Search