Hardware Reference
In-Depth Information
In reality, there are many more lanes in GPUs, so GPU “chimes” are shorter. While a vector
processor might have 2 to 8 lanes and a vector length of, say, 32—making a chime 4 to 16 clock
cycles—a multithreaded SIMD Processor might have 8 or 16 lanes. A SIMD thread is 32 ele-
ments wide, so a GPU chime would just be 2 or 4 clock cycles. This difference is why we use
“SIMD Processor” as the more descriptive term because it is closer to a SIMD design than it is
to a traditional vector processor design.
The closest GPU term to a vectorized loop is Grid, and a PTX instruction is the closest to a
vector instruction since a SIMD Thread broadcasts a PTX instruction to all SIMD Lanes.
With respect to memory access instructions in the two architectures, all GPU loads are gath-
er instructions and all GPU stores are scater instructions. If data addresses of CUDA Threads
refer to nearby addresses that fall in the same cache/memory block at the same time, the Ad-
dress Coalescing Unit of the GPU will ensure high memory bandwidth. The explicit unit-stride
load and store instructions of vector architectures versus the implicit unit stride of GPU pro-
gramming is why writing efficient GPU code requires that programmers think in terms of
SIMD operations, even though the CUDA programming model looks like MIMD. As CUDA
Threads can generate their own addresses, strided as well as gather-scater, addressing vectors
are found in both vector architectures and GPUs.
As we mentioned several times, the two architectures take very different approaches to hid-
ing memory latency. Vector architectures amortize it across all the elements of the vector by
having a deeply pipelined access so you pay the latency only once per vector load or store.
Hence, vector loads and stores are like a block transfer between memory and the vector re-
gisters. In contrast, GPUs hide memory latency using multithreading. (Some researchers are
investigating adding multithreading to vector architectures to try to capture the best of both
worlds.)
With respect to conditional branch instructions, both architectures implement them using
mask registers. Both conditional branch paths occupy time and/or space even when they do
not store a result. The difference is that the vector compiler manages mask registers explicitly
in software while the GPU hardware and assembler manages them implicitly using branch
synchronization markers and an internal stack to save, complement, and restore masks.
As mentioned above, the conditional branch mechanism of GPUs gracefully handles the
strip-mining problem of vector architectures. When the vector length is unknown at compile
time, the program must calculate the modulo of the application vector length and the maxim-
um vector length and store it in the vector length register. The strip-minded loop then resets
the vector length register to the maximum vector length for the rest of the loop. This case is
simpler with GPUs since they just iterate the loop until all the SIMD Lanes reach the loop
bound. On the last iteration, some SIMD Lanes will be masked of and then restored after the
loop completes.
The control processor of a vector computer plays an important role in the execution of
vector instructions. It broadcasts operations to all the vector lanes and broadcasts a scalar re-
gister value for vector-scalar operations. It also does implicit calculations that are explicit in
GPUs, such as automatically incrementing memory addresses for unit-stride and non-unit-
stride loads and stores. The control processor is missing in the GPU. The closest analogy is the
Thread Block Scheduler, which assigns Thread Blocks (bodies of vector loop) to multithreaded
SIMD Processors. The runtime hardware mechanisms in a GPU that both generate addresses
and then discover if they are adjacent, which is commonplace in many DLP applications, are
likely less power efficient than using a control processor.
The scalar processor in a vector computer executes the scalar instructions of a vector pro-
gram; that is, it performs operations that would be too slow to do in the vector unit. Although
Search WWH ::




Custom Search