Hardware Reference
In-Depth Information
of instructions that are not executed by any lane of a SIMD thread. This optimization is useful
in error condition checking, for example, where the test must be made but is rarely taken.
The code for a conditional statement similar to the one in Section 4.2 is
if (X[i] != 0)
X[i] = X[i] − Y[i];
else X[i] = Z[i];
This IF statement could compile to the following PTX instructions (assuming that R8 already
has the scaled thread ID), with *Push , *Comp , *Pop indicating the branch synchronization markers
inserted by the PTX assembler that push the old mask, complement the current mask, and pop
to restore the old mask:
ld.global.f64 RD0, [X+R8] ; RD0 = X[i]
setp.neq.s32 P1, RD0, #0 ; P1 is predicate register 1
@!P1, bra ELSE1, *Push ; Push old mask, set new mask bits
; if P1 false, go to ELSE1
ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]
sub.f64 RD0, RD0, RD2 ; Difference in RD0
st.global.f64 [X+R8], RD0 ; X[i] = RD0
@P1, bra ENDIF1, *Comp ; complement mask bits
; if P1 true, go to ENDIF1
ELSE1: ld.global.f64 RD0, [Z+R8] ; RD0 = Z[i]
st.global.f64 [X+R8], RD0 ; X[i] = RD0
ENDIF1: <next instruction> , *Pop ; pop to restore old mask
Once again, normally all instructions in the IF-THEN-ELSE statement are executed by a
SIMD Processor. It's just that only some of the SIMD Lanes are enabled for the THEN instruc-
tions and some lanes for the ELSE instructions. As mentioned above, in the surprisingly com-
mon case that the individual lanes agree on the predicated branch—such as branching on a
parameter value that is the same for all lanes so that all active mask bits are zeros or all are
ones—the branch skips the THEN instructions or the ELSE instructions.
This flexibility makes it appear that an element has its own program counter; however, in
the slowest case only one SIMD Lane could store its result every two clock cycles, with the rest
idle. The analogous slowest case for vector architectures is operating with only one mask bit
set to one. This flexibility can lead naive GPU programmers to poor performance, but it can
be helpful in the early stages of program development. Keep in mind, however, that the only
choice for a SIMD Lane in a clock cycle is to perform the operation specified in the PTX in-
struction or be idle; two SIMD Lanes cannot simultaneously execute different instructions.
This flexibility also helps explain the name CUDA Thread given to each element in a thread
of SIMD instructions, since it gives the illusion of acting independently. A naive programmer
may think that this thread abstraction means GPUs handle conditional branches more grace-
fully. Some threads go one way, the rest go another, which seems true as long as you're not
in a hurry. Each CUDA Thread is executing the same instruction as every other thread in the
thread block or it is idle. This synchronization makes it easier to handle loops with conditional
branches since the mask capability can turn of SIMD Lanes and it detects the end of the loop
automatically.
The resulting performance sometimes belies that simple abstraction. Writing programs that
operate SIMD Lanes in this highly independent MIMD mode is like writing programs that use
lots of virtual address space on a computer with a smaller physical memory. Both are correct,
but they may run so slowly that the programmer could be displeased with the result.
Search WWH ::




Custom Search