Hardware Reference
In-Depth Information
also have individual per-lane predication (enable/disable), specified with a 1-bit predicate re-
gister for each lane.
The PTX assembler typically optimizes a simple outer-level IF/THEN/ELSE statement coded
with PTX branch instructions to just predicated GPU instructions, without any GPU branch
instructions. A more complex control flow typically results in a mixture of predication and
GPU branch instructions with special instructions and markers that use the branch synchron-
ization stack to push a stack entry when some lanes branch to the target address, while oth-
ers fall through. NVIDIA says a branch diverges when this happens. This mixture is also used
when a SIMD Lane executes a synchronization marker or converges , which pops a stack entry
and branches to the stack-entry address with the stack-entry thread-active mask.
The PTX assembler identifies loop branches and generates GPU branch instructions that
branch to the top of the loop, along with special stack instructions to handle individual lanes
breaking out of the loop and converging the SIMD Lanes when all lanes have completed the
loop. GPU indexed jump and indexed call instructions push entries on the stack so that when
all lanes complete the switch statement or function call the SIMD thread converges.
A GPU set predicate instruction ( setp in the figure above) evaluates the conditional part of
the IF statement. The PTX branch instruction then depends on that predicate. If the PTX as-
sembler generates predicated instructions with no GPU branch instructions, it uses a per-lane
predicate register to enable or disable each SIMD Lane for each instruction. The SIMD instruc-
tions in the threads inside the THEN part of the IF statement broadcast operations to all the
SIMD Lanes. Those lanes with the predicate set to one perform the operation and store the res-
ult, and the other SIMD Lanes don't perform an operation or store a result. For the ELSE state-
ment, the instructions use the complement of the predicate (relative to the THEN statement),
so the SIMD Lanes that were idle now perform the operation and store the result while their
formerly active siblings don't. At the end of the ELSE statement, the instructions are unpredic-
ated so the original computation can proceed. Thus, for equal length paths, an IF-THEN-ELSE
operates at 50% efficiency,
IF statements can be nested, hence the use of a stack, and the PTX assembler typically gener-
ates a mix of predicated instructions and GPU branch and special synchronization instructions
for complex control flow. Note that deep nesting can mean that most SIMD Lanes are idle dur-
ing execution of nested conditional statements. Thus, doubly nested IF statements with equal-
length paths run at 25% efficiency, triply nested at 12.5% efficiency, and so on. The analogous
case would be a vector processor operating where only a few of the mask bits are ones.
Dropping down a level of detail, the PTX assembler sets a “branch synchronization” marker
on appropriate conditional branch instructions that pushes the current active mask on a stack
inside each SIMD thread. If the conditional branch diverges the (some lanes take the branch,
some fall through), it pushes a stack entry and sets the current internal active mask based on
the condition. A branch synchronization marker pops the diverged branch entry and flips the
mask bits before the ELSE portion. At the end of the IF statement, the PTX assembler adds
another branch synchronization marker that pops the prior active mask of the stack into the
current active mask.
If all the mask bits are set to one, then the branch instruction at the end of the THEN skips
over the instructions in the ELSE part. There is a similar optimization for the THEN part in
case all the mask bits are zero, as the conditional branch jumps over the THEN instructions.
Parallel IF statements and PTX branches often use branch conditions that are unanimous (all
lanes agree to follow the same path), such that the SIMD thread does not diverge into diferent
individual lane control flow. The PTX assembler optimizes such branches to skip over blocks
Search WWH ::




Custom Search