Hardware Reference
In-Depth Information
for multithreading, where each core has 16 lanes. The biggest difference is multithreading,
which is fundamental to GPUs and missing from most vector processors.
Looking at the registers in the two architectures, the VMIPS register file holds entire vec-
tors—that is, a contiguous block of 64 doubles. In contrast, a single vector in a GPU would be
distributed across the registers of all SIMD Lanes. A VMIPS processor has 8 vector registers
with 64 elements, or 512 elements total. A GPU thread of SIMD instructions has up to 64 re-
gisters with 32 elements each, or 2048 elements. These extra GPU registers support multith-
reading.
Figure 4.22 is a block diagram of the execution units of a vector processor on the left and
a multithreaded SIMD Processor of a GPU on the right. For pedagogic purposes, we assume
the vector processor has four lanes and the multithreaded SIMD Processor also has four SIMD
Lanes. This figure shows that the four SIMD Lanes act in concert much like a four-lane vector
unit, and that a SIMD Processor acts much like a vector processor.
FIGURE 4.22 A vector processor with four lanes on the left and a multithreaded SIMD
Processor of a GPU with four SIMD Lanes on the right . (GPUs typically have 8 to 16 SIMD
Lanes.) The control processor supplies scalar operands for scalar-vector operations, incre-
ments addressing for unit and non-unit stride accesses to memory, and performs other
accounting-type operations. Peak memory performance only occurs in a GPU when the Ad-
dress Coalescing unit can discover localized addressing. Similarly, peak computational per-
formance occurs when all internal mask bits are set identically. Note that the SIMD Processor
has one PC per SIMD thread to help with multithreading.
 
Search WWH ::




Custom Search