Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

for multithreading, where each core has 16 lanes. The biggest difference is multithreading,

which is fundamental to GPUs and missing from most vector processors.

Looking at the registers in the two architectures, the VMIPS register file holds entire vec-

tors—that is, a contiguous block of 64 doubles. In contrast, a single vector in a GPU would be

distributed across the registers of all SIMD Lanes. A VMIPS processor has 8 vector registers

with 64 elements, or 512 elements total. A GPU thread of SIMD instructions has up to 64 re-

gisters with 32 elements each, or 2048 elements. These extra GPU registers support multith-

reading.

Figure 4.22 is a block diagram of the execution units of a vector processor on the left and

a multithreaded SIMD Processor of a GPU on the right. For pedagogic purposes, we assume

the vector processor has four lanes and the multithreaded SIMD Processor also has four SIMD

Lanes. This figure shows that the four SIMD Lanes act in concert much like a four-lane vector

unit, and that a SIMD Processor acts much like a vector processor.

FIGURE 4.22 A vector processor with four lanes on the left and a multithreaded SIMD

Processor of a GPU with four SIMD Lanes on the right . (GPUs typically have 8 to 16 SIMD

Lanes.) The control processor supplies scalar operands for scalar-vector operations, incre-

ments addressing for unit and non-unit stride accesses to memory, and performs other

accounting-type operations. Peak memory performance only occurs in a GPU when the Ad-

dress Coalescing unit can discover localized addressing. Similarly, peak computational per-

formance occurs when all internal mask bits are set identically. Note that the SIMD Processor

has one PC per SIMD thread to help with multithreading.

Search WWH ::

Custom Search

Home