Graphics Reference
In-Depth Information
c0
c1
Element
Element
Element
Element
Element
Element
Element
Element
t
t
0
1
2
3
0
1
2
3
no
yes
0
c0
c0
c1
n0
n1
c2
c3
c0
c1
n0
n1
c2
c3
c0
0
c0
c0
c1
n0
n1
c0
c1
c0
p
1
2
3
4
5
c1
c1
1
2
3
4
5
c1
c1
n0
y0
n0
n0
n1
y1
n1
n1
c2
c2
y0
y0
y1
y2
y3
y0
y2
c3
c3
y1
y1
6
7
8
9
y2
y2
y3
y3
y3
c2
c2
c2
c2
c2
c3
c3
c3
c3
c3
A: Shader program
B: Nondiverging execution
C: Diverging execution
Figure 38.9: Diverging and nondiverging execution on a four-element predicated vector
core. Each element executes the ten-operation shader A that branches on predicate p. In
case B, all four elements take the no branch, there is no divergence, and only six execu-
tion steps are required. In case C, element one takes the no branch, but the other three
elements take the yes branch. Predication handles this divergence by executing the no and
yes operations separately, so all ten execution steps are required.
element executes a completely different program section (e.g., the shader is equiv-
alent to a C++ switch statement with selection driven by unique element indices)
divergence is complete and all parallelism is lost. More typically, elements share
portions of code, which the predicated SIMD core executes in parallel, so paral-
lelism is merely reduced.
Figure 38.9 illustrates such a typical situation, using a simplified four-element
vector core. Shader A includes a single conditional branch that selects either the
two-operation no path or the four-operation yes path. Another four operations, two
ahead of the branch and two after, are common to both paths: They are executed
regardless of which path is taken. In the first example, B, all four elements take the
no branch, so there is no divergence. Predication handles this case by executing
each common no -path operation in parallel across the four-wide vector. Because
no elements require the yes -path operations, they are never executed and no cycles
are lost to them. Thus, the entire shader executes in only six cycles.
In the second example, C, element one takes the no branch, but the other ele-
ments take the yes branch. Because both branches are taken, execution diverges.
Predication handles this divergence by first executing each no operation on the
single element that requires it, then executing each yes operation in parallel across
the three elements that require it. Because both no and yes operations are executed,
shader execution requires the full ten cycles to complete.
Obviously, divergence reduces the efficiency of computation on a predicated
SIMD core. We quantify this loss by computing utilization: the ratio of the use-
ful work that is done to the number of operation slots made available for that
 
Search WWH ::




Custom Search