Graphics Reference
In-Depth Information
work. Let n be the vector's length, i pred be the number of predicated instruction
steps required to execute the vector to completion, and i seq be the total number of
instruction steps that would be required to execute the element's shaders sequen-
tially, one at a time, as though on a single processor. Then the utilization of this
vector's execution ( u vec ) is the ratio of useful work done ( i seq ) to the number of
slots available for that work ( n
·
i pred ):
i seq
u vec =
i pred .
(38.3)
n
·
In example B, the nondiverging case,
u vec = 6 + 6 + 6 + 6
4
= 1.0,
(38.4)
×
6
which is the maximum possible value, indicating full utilization. In example C,
the diverging case,
u vec = 8 + 6 + 8 + 8
4
= 0.75,
(38.5)
×
10
indicating partial utilization. Predication ensures that an operation is executed for
at least one element during each cycle, so the worst possible utilization for an n -
wide vector core is 1
n . Minimum utilization is achieved in the switch-statement
situation described above; it is approached asymptotically when a single element
executes a path that is much longer than the paths executed by the other elements.
Utilization directly scales performance—the 0.75 utilization achieved in
example C corresponds to 75% of peak performance, or 33% additional running
time (when aggregated across many elements). Because poor utilization is the
direct result of divergence, it is useful to understand the likelihood of divergence,
perhaps as a first step to minimizing it.
Again consider a shader with a single conditional branch. Let p be the branch's
probability of taking the yes path, and 1
/
p be its probability of taking the no path.
Then, if p is evaluated independently for each element, that is, if evaluations of p
had no locality, then divergence outcome probabilities for an n -wide vector core
are:
p n
no divergence, all yes outcomes
p ) n
( 1
no divergence, all no outcomes
(38.6)
p n +( 1
p ) n divergence, various utilizations.
1
Unless p is either very near to zero or very near to one, the probability of diverging
increases rapidly as vector length n increases. For example, the probability of
divergence with p = 0.1 is 34% for n = 4, but it increases to 81% for n = 16,
97% for n = 32, and 99.9% for n = 64. Even with p = 0.01, a seemingly low
probability, divergence occurs almost half the time (47%) for a vector length of
64. These odds might dissuade GPU architects from implementing wide vector
units if they were correct, but in general they are not.
In fact, evaluations of p are not independent—they tend to cluster into yes
groups and no groups. Temporal locality predicts this: Clusters of repeated ref-
erences suggest that the same code branch is executed repeatedly. The geometric
nature of computer graphics often strengthens the effect. Consider the typical case
of a predicate p that is true in shadow and false otherwise. Some triangles will be
 
Search WWH ::




Custom Search