Hardware Reference
In-Depth Information
The following sequence of PTX instructions is for one iteration of our DAXPY loop on page
289:
shl.u32 R8, blockIdx, 9 ; Thread Block ID * Block size (512 or 2 9 )
add.u32 R8, R8, threadIdx ; R8 = i = my CUDA Thread ID
shl.u32 R8, R8, 3 ; byte offset
ld.global.f64 RD0, [X+R8] ; RD0 = X[i]
ld.global.f64 RD2, [Y+R8] ; RD2 = Y[i]
mul.f64 RD0, RD0, RD4 ; Product in RD0 = RD0 * RD4 (scalar a)
add.f64 RD0, RD0, RD2 ; Sum in RD0 = RD0 + RD2 (Y[i])
st.global.f64 [Y+R8], RD0 ; Y[i] = sum (X[i]*a + Y[i])
As demonstrated above, the CUDA programming model assigns one CUDA Thread to each
loop iteration and offers a unique identifier number to each thread block ( blockIdx ) and one to
each CUDA Thread within a block ( threadIdx ). Thus, it creates 8192 CUDA Threads and uses
the unique number to address each element in the array, so there is no incrementing or branch-
ing code. The first three PTX instructions calculate that unique element byte offset in R8 , which
is added to the base of the arrays. The following PTX instructions load two double-precision
loating-point operands, multiply and add them, and store the sum. (We'll describe the PTX
code corresponding to the CUDA code “if (i < n)” below.)
Note that unlike vector architectures, GPUs don't have separate instructions for sequential
data transfers, strided data transfers, and gather-scater data transfers. All data transfers are
gather-scater! To regain the eiciency of sequential (unit-stride) data transfers, GPUs include
special Address Coalescing hardware to recognize when the SIMD Lanes within a thread of
SIMD instructions are collectively issuing sequential addresses. That runtime hardware then
notiies the Memory Interface Unit to request a block transfer of 32 sequential words. To get
this important performance improvement, the GPU programmer must ensure that adjacent
CUDA Threads access nearby addresses at the same time that can be coalesced into one or a
few memory or cache blocks, which our example does.
Conditional Branching In GPUs
Just like the case with unit-stride data transfers, there are strong similarities between how vec-
tor architectures and GPUs handle IF statements, with the former implementing the mechan-
ism largely in software with limited hardware support and the later making use of even more
hardware. As we shall see, in addition to explicit predicate registers, GPU branch hardware
uses internal masks, a branch synchronization stack, and instruction markers to manage when
a branch diverges into multiple execution paths and when the paths converge.
At the PTX assembler level, control flow of one CUDA thread is described by the PTX in-
structions branch, call, return, and exit, plus individual per-thread-lane predication of each
instruction, specified by the programmer with per-thread-lane 1-bit predicate registers. The
PTX assembler analyzes the PTX branch graph and optimizes it to the fastest GPU hardware
instruction sequence.
At the GPU hardware instruction level, control flow includes branch, jump, jump indexed,
call, call indexed, return, exit, and special instructions that manage the branch synchronization
stack. GPU hardware provides each SIMD thread with its own stack; a stack entry contains an
identiier token, a target instruction address, and a target thread-active mask. There are GPU
special instructions that push stack entries for a SIMD thread and special instructions and in-
struction markers that pop a stack entry or unwind the stack to a specified entry and branch to
the target instruction address with the target thread-active mask. GPU hardware instructions
Search WWH ::




Custom Search