Hardware Reference
In-Depth Information
Note that a CUDA thread is just a vertical cut of a thread of SIMD instructions, correspond-
ing to one element executed by one SIMD Lane. Beware that CUDA Threads are very diferent
from POSIX threads; you can't make arbitrary system calls from a CUDA Thread.
We're now ready to see what GPU instructions look like.
NVIDA GPU Instruction Set Architecture
Unlike most system processors, the instruction set target of the NVIDIA compilers is an ab-
straction of the hardware instruction set. PTX ( Parallel Thread Execution ) provides a stable in-
struction set for compilers as well as compatibility across generations of GPUs. The hardware
instruction set is hidden from the programmer. PTX instructions describe the operations on
a single CUDA thread, and usually map one-to-one with hardware instructions, but one PTX
can expand to many machine instructions, and vice versa. PTX uses virtual registers, so the
compiler figures out how many physical vector registers a SIMD thread needs, and then an
optimizer divides the available register storage between the SIMD threads. This optimizer also
eliminates dead code, folds instructions together, and calculates places where branches might
diverge and places where diverged paths could converge.
While there is some similarity between the x86 microarchitectures and PTX, in that both
translate to an internal form (microinstructions for x86), the difference is that this translation
happens in hardware at runtime during execution on the x86 versus in software and load time
on a GPU.
The format of a PTX instruction is
opcode.type d, a, b, c;
where d is the destination operand; a , b , and c are source operands; and the operation type is
one of the following:
Type
.type Specifier
Untyped bits 8, 16, 32, and 64 bits
.b8, .b16, .b32, .b64
Unsigned integer 8, 16, 32, and 64 bits .u8, .u16, .u32, .u64
Signed integer 8, 16, 32, and 64 bits
.s8, .s16, .s32, .s64
Floating Point 16, 32, and 64 bits
.f16, .f32, .f64
Source operands are 32-bit or 64-bit registers or a constant value. Destinations are registers,
except for store instructions.
Figure 4.17 shows the basic PTX instruction set. All instructions can be predicated by 1-bit
predicate registers, which can be set by a set predicate instruction ( setp ). The control low
instructions are functions call and return , thread exit , branch , and barrier synchronization for
threads within a thread block (bar.sync) . Placing a predicate in front of a branch instruction
gives us conditional branches. The compiler or PTX programmer declares virtual registers as
32-bit or 64-bit typed or untyped values. For example, R0 , R1 , are for 32-bit values and RD0 , RD1 ,
are for 64-bit registers. Recall that the assignment of virtual registers to physical registers oc-
curs at load time with PTX.
Search WWH ::




Custom Search