Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

The following sequence of PTX instructions is for one iteration of our DAXPY loop on page

289:

shl.u32 R8, blockIdx, 9 ; Thread Block ID * Block size (512 or 2 9 )

add.u32 R8, R8, threadIdx ; R8 = i = my CUDA Thread ID

shl.u32 R8, R8, 3 ; byte offset

ld.global.f64 RD0, [X+R8] ; RD0 = X[i]

ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]

mul.f64 RD0, RD0, RD4 ; Product in RD0 = RD0 * RD4 (scalar a)

add.f64 RD0, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i])

st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i])

As demonstrated above, the CUDA programming model assigns one CUDA Thread to each

loop iteration and offers a unique identifier number to each thread block ( blockIdx ) and one to

each CUDA Thread within a block ( threadIdx ). Thus, it creates 8192 CUDA Threads and uses

the unique number to address each element in the array, so there is no incrementing or branch-

ing code. The first three PTX instructions calculate that unique element byte offset in R8 , which

is added to the base of the arrays. The following PTX instructions load two double-precision

loating-point operands, multiply and add them, and store the sum. (We'll describe the PTX

code corresponding to the CUDA code “if (i < n)” below.)

Note that unlike vector architectures, GPUs don't have separate instructions for sequential

data transfers, strided data transfers, and gather-scater data transfers. All data transfers are

gather-scater! To regain the eiciency of sequential (unit-stride) data transfers, GPUs include

special Address Coalescing hardware to recognize when the SIMD Lanes within a thread of

SIMD instructions are collectively issuing sequential addresses. That runtime hardware then

notiies the Memory Interface Unit to request a block transfer of 32 sequential words. To get

this important performance improvement, the GPU programmer must ensure that adjacent

CUDA Threads access nearby addresses at the same time that can be coalesced into one or a

few memory or cache blocks, which our example does.

Conditional Branching In GPUs

Just like the case with unit-stride data transfers, there are strong similarities between how vec-

tor architectures and GPUs handle IF statements, with the former implementing the mechan-

ism largely in software with limited hardware support and the later making use of even more

hardware. As we shall see, in addition to explicit predicate registers, GPU branch hardware

uses internal masks, a branch synchronization stack, and instruction markers to manage when

a branch diverges into multiple execution paths and when the paths converge.

At the PTX assembler level, control flow of one CUDA thread is described by the PTX in-

structions branch, call, return, and exit, plus individual per-thread-lane predication of each

instruction, specified by the programmer with per-thread-lane 1-bit predicate registers. The

PTX assembler analyzes the PTX branch graph and optimizes it to the fastest GPU hardware

instruction sequence.

At the GPU hardware instruction level, control flow includes branch, jump, jump indexed,

call, call indexed, return, exit, and special instructions that manage the branch synchronization

stack. GPU hardware provides each SIMD thread with its own stack; a stack entry contains an

identiier token, a target instruction address, and a target thread-active mask. There are GPU

special instructions that push stack entries for a SIMD thread and special instructions and in-

struction markers that pop a stack entry or unwind the stack to a specified entry and branch to

the target instruction address with the target thread-active mask. GPU hardware instructions

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home