Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

also have individual per-lane predication (enable/disable), specified with a 1-bit predicate re-

gister for each lane.

The PTX assembler typically optimizes a simple outer-level IF/THEN/ELSE statement coded

with PTX branch instructions to just predicated GPU instructions, without any GPU branch

instructions. A more complex control flow typically results in a mixture of predication and

GPU branch instructions with special instructions and markers that use the branch synchron-

ization stack to push a stack entry when some lanes branch to the target address, while oth-

ers fall through. NVIDIA says a branch diverges when this happens. This mixture is also used

when a SIMD Lane executes a synchronization marker or converges , which pops a stack entry

and branches to the stack-entry address with the stack-entry thread-active mask.

The PTX assembler identifies loop branches and generates GPU branch instructions that

branch to the top of the loop, along with special stack instructions to handle individual lanes

breaking out of the loop and converging the SIMD Lanes when all lanes have completed the

loop. GPU indexed jump and indexed call instructions push entries on the stack so that when

all lanes complete the switch statement or function call the SIMD thread converges.

A GPU set predicate instruction ( setp in the figure above) evaluates the conditional part of

the IF statement. The PTX branch instruction then depends on that predicate. If the PTX as-

sembler generates predicated instructions with no GPU branch instructions, it uses a per-lane

predicate register to enable or disable each SIMD Lane for each instruction. The SIMD instruc-

tions in the threads inside the THEN part of the IF statement broadcast operations to all the

SIMD Lanes. Those lanes with the predicate set to one perform the operation and store the res-

ult, and the other SIMD Lanes don't perform an operation or store a result. For the ELSE state-

ment, the instructions use the complement of the predicate (relative to the THEN statement),

so the SIMD Lanes that were idle now perform the operation and store the result while their

formerly active siblings don't. At the end of the ELSE statement, the instructions are unpredic-

ated so the original computation can proceed. Thus, for equal length paths, an IF-THEN-ELSE

operates at 50% efficiency,

IF statements can be nested, hence the use of a stack, and the PTX assembler typically gener-

ates a mix of predicated instructions and GPU branch and special synchronization instructions

for complex control flow. Note that deep nesting can mean that most SIMD Lanes are idle dur-

ing execution of nested conditional statements. Thus, doubly nested IF statements with equal-

length paths run at 25% efficiency, triply nested at 12.5% efficiency, and so on. The analogous

case would be a vector processor operating where only a few of the mask bits are ones.

Dropping down a level of detail, the PTX assembler sets a “branch synchronization” marker

on appropriate conditional branch instructions that pushes the current active mask on a stack

inside each SIMD thread. If the conditional branch diverges the (some lanes take the branch,

some fall through), it pushes a stack entry and sets the current internal active mask based on

the condition. A branch synchronization marker pops the diverged branch entry and flips the

mask bits before the ELSE portion. At the end of the IF statement, the PTX assembler adds

another branch synchronization marker that pops the prior active mask of the stack into the

current active mask.

If all the mask bits are set to one, then the branch instruction at the end of the THEN skips

over the instructions in the ELSE part. There is a similar optimization for the THEN part in

case all the mask bits are zero, as the conditional branch jumps over the THEN instructions.

Parallel IF statements and PTX branches often use branch conditions that are unanimous (all

lanes agree to follow the same path), such that the SIMD thread does not diverge into diferent

individual lane control flow. The PTX assembler optimizes such branches to skip over blocks

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home