Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

In reality, there are many more lanes in GPUs, so GPU “chimes” are shorter. While a vector

processor might have 2 to 8 lanes and a vector length of, say, 32—making a chime 4 to 16 clock

cycles—a multithreaded SIMD Processor might have 8 or 16 lanes. A SIMD thread is 32 ele-

ments wide, so a GPU chime would just be 2 or 4 clock cycles. This difference is why we use

“SIMD Processor” as the more descriptive term because it is closer to a SIMD design than it is

to a traditional vector processor design.

The closest GPU term to a vectorized loop is Grid, and a PTX instruction is the closest to a

vector instruction since a SIMD Thread broadcasts a PTX instruction to all SIMD Lanes.

With respect to memory access instructions in the two architectures, all GPU loads are gath-

er instructions and all GPU stores are scater instructions. If data addresses of CUDA Threads

refer to nearby addresses that fall in the same cache/memory block at the same time, the Ad-

dress Coalescing Unit of the GPU will ensure high memory bandwidth. The explicit unit-stride

load and store instructions of vector architectures versus the implicit unit stride of GPU pro-

gramming is why writing efficient GPU code requires that programmers think in terms of

SIMD operations, even though the CUDA programming model looks like MIMD. As CUDA

Threads can generate their own addresses, strided as well as gather-scater, addressing vectors

are found in both vector architectures and GPUs.

As we mentioned several times, the two architectures take very different approaches to hid-

ing memory latency. Vector architectures amortize it across all the elements of the vector by

having a deeply pipelined access so you pay the latency only once per vector load or store.

Hence, vector loads and stores are like a block transfer between memory and the vector re-

gisters. In contrast, GPUs hide memory latency using multithreading. (Some researchers are

investigating adding multithreading to vector architectures to try to capture the best of both

worlds.)

With respect to conditional branch instructions, both architectures implement them using

mask registers. Both conditional branch paths occupy time and/or space even when they do

not store a result. The difference is that the vector compiler manages mask registers explicitly

in software while the GPU hardware and assembler manages them implicitly using branch

synchronization markers and an internal stack to save, complement, and restore masks.

As mentioned above, the conditional branch mechanism of GPUs gracefully handles the

strip-mining problem of vector architectures. When the vector length is unknown at compile

time, the program must calculate the modulo of the application vector length and the maxim-

um vector length and store it in the vector length register. The strip-minded loop then resets

the vector length register to the maximum vector length for the rest of the loop. This case is

simpler with GPUs since they just iterate the loop until all the SIMD Lanes reach the loop

bound. On the last iteration, some SIMD Lanes will be masked of and then restored after the

loop completes.

The control processor of a vector computer plays an important role in the execution of

vector instructions. It broadcasts operations to all the vector lanes and broadcasts a scalar re-

gister value for vector-scalar operations. It also does implicit calculations that are explicit in

GPUs, such as automatically incrementing memory addresses for unit-stride and non-unit-

stride loads and stores. The control processor is missing in the GPU. The closest analogy is the

Thread Block Scheduler, which assigns Thread Blocks (bodies of vector loop) to multithreaded

SIMD Processors. The runtime hardware mechanisms in a GPU that both generate addresses

and then discover if they are adjacent, which is commonplace in many DLP applications, are

likely less power efficient than using a control processor.

The scalar processor in a vector computer executes the scalar instructions of a vector pro-

gram; that is, it performs operations that would be too slow to do in the vector unit. Although

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home