Graphics Reference
In-Depth Information
work. Let
n
be the vector's length,
i
pred
be the number of predicated instruction
steps required to execute the vector to completion, and
i
seq
be the total number of
instruction steps that would be required to execute the element's shaders sequen-
tially, one at a time, as though on a single processor. Then the utilization of this
vector's execution (
u
vec
) is the ratio of useful work done (
i
seq
) to the number of
slots available for that work (
n
·
i
pred
):
i
seq
u
vec
=
i
pred
.
(38.3)
n
·
In example B, the nondiverging case,
u
vec
=
6
+
6
+
6
+
6
4
=
1.0,
(38.4)
×
6
which is the maximum possible value, indicating full utilization. In example C,
the diverging case,
u
vec
=
8
+
6
+
8
+
8
4
=
0.75,
(38.5)
×
10
indicating partial utilization. Predication ensures that an operation is executed for
at least one element during each cycle, so the worst possible utilization for an
n
-
wide vector core is 1
n
. Minimum utilization is achieved in the switch-statement
situation described above; it is approached asymptotically when a single element
executes a path that is much longer than the paths executed by the other elements.
Utilization directly scales performance—the 0.75 utilization achieved in
example C corresponds to 75% of peak performance, or 33% additional running
time (when aggregated across many elements). Because poor utilization is the
direct result of divergence, it is useful to understand the likelihood of divergence,
perhaps as a first step to minimizing it.
Again consider a shader with a single conditional branch. Let
p
be the branch's
probability of taking the
yes
path, and 1
/
p
be its probability of taking the
no
path.
Then, if
p
is evaluated independently for each element, that is, if evaluations of
p
had no locality, then divergence outcome probabilities for an
n
-wide vector core
are:
−
p
n
no divergence, all
yes
outcomes
p
)
n
(
1
−
no divergence, all
no
outcomes
(38.6)
−
p
n
+(
1
p
)
n
divergence, various utilizations.
1
−
Unless
p
is either very near to zero or very near to one, the probability of diverging
increases rapidly as vector length
n
increases. For example, the probability of
divergence with
p
=
0.1 is 34% for
n
=
4, but it increases to 81% for
n
=
16,
97% for
n
=
32, and 99.9% for
n
=
64. Even with
p
=
0.01, a seemingly low
probability, divergence occurs almost half the time (47%) for a vector length of
64. These odds might dissuade GPU architects from implementing wide vector
units if they were correct, but in general they are not.
In fact, evaluations of
p
are not independent—they tend to cluster into
yes
groups and
no
groups. Temporal locality predicts this: Clusters of repeated ref-
erences suggest that the same code branch is executed repeatedly. The geometric
nature of computer graphics often strengthens the effect. Consider the typical case
of a predicate
p
that is
true
in shadow and
false
otherwise. Some triangles will be