Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

of instructions that are not executed by any lane of a SIMD thread. This optimization is useful

in error condition checking, for example, where the test must be made but is rarely taken.

The code for a conditional statement similar to the one in Section 4.2 is

if (X[i] != 0)

X[i] = X[i] − Y[i];

else X[i] = Z[i];

This IF statement could compile to the following PTX instructions (assuming that R8 already

has the scaled thread ID), with *Push , *Comp , *Pop indicating the branch synchronization markers

inserted by the PTX assembler that push the old mask, complement the current mask, and pop

to restore the old mask:

ld.global.f64 RD0, [X+R8] ; RD0 = X[i]

setp.neq.s32 P1, RD0, #0 ; P1 is predicate register 1

@!P1, bra ELSE1, *Push ; Push old mask, set new mask bits

; if P1 false, go to ELSE1

ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]

sub.f64 RD0, RD0, RD2 ; Difference in RD0

st.global.f64 [X+R8], RD0 ; X[i] = RD0

@P1, bra ENDIF1, *Comp ; complement mask bits

; if P1 true, go to ENDIF1

ELSE1: ld.global.f64 RD0, [Z+R8] ; RD0 = Z[i]

st.global.f64 [X+R8], RD0 ; X[i] = RD0

ENDIF1: <next instruction> , *Pop ; pop to restore old mask

Once again, normally all instructions in the IF-THEN-ELSE statement are executed by a

SIMD Processor. It's just that only some of the SIMD Lanes are enabled for the THEN instruc-

tions and some lanes for the ELSE instructions. As mentioned above, in the surprisingly com-

mon case that the individual lanes agree on the predicated branch—such as branching on a

parameter value that is the same for all lanes so that all active mask bits are zeros or all are

ones—the branch skips the THEN instructions or the ELSE instructions.

This flexibility makes it appear that an element has its own program counter; however, in

the slowest case only one SIMD Lane could store its result every two clock cycles, with the rest

idle. The analogous slowest case for vector architectures is operating with only one mask bit

set to one. This flexibility can lead naive GPU programmers to poor performance, but it can

be helpful in the early stages of program development. Keep in mind, however, that the only

choice for a SIMD Lane in a clock cycle is to perform the operation specified in the PTX in-

struction or be idle; two SIMD Lanes cannot simultaneously execute different instructions.

This flexibility also helps explain the name CUDA Thread given to each element in a thread

of SIMD instructions, since it gives the illusion of acting independently. A naive programmer

may think that this thread abstraction means GPUs handle conditional branches more grace-

fully. Some threads go one way, the rest go another, which seems true as long as you're not

in a hurry. Each CUDA Thread is executing the same instruction as every other thread in the

thread block or it is idle. This synchronization makes it easier to handle loops with conditional

branches since the mask capability can turn of SIMD Lanes and it detects the end of the loop

automatically.

The resulting performance sometimes belies that simple abstraction. Writing programs that

operate SIMD Lanes in this highly independent MIMD mode is like writing programs that use

lots of virtual address space on a computer with a smaller physical memory. Both are correct,

but they may run so slowly that the programmer could be displeased with the result.

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home