Hardware Reference
In-Depth Information
FIGURE 4.5 Structure of a vector unit containing four lanes . The vector register storage
is divided across the lanes, with each lane holding every fourth element of each vector re-
gister. The figure shows three vector functional units: an FP add, an FP multiply, and a load-
store unit. Each of the vector arithmetic units contains four execution pipelines, one per lane,
which act in concert to complete a single vector instruction. Note how each section of the vec-
tor register file only needs to provide enough ports for pipelines local to its lane. This figure
does not show the path to provide the scalar operand for vector-scalar instructions, but the
scalar processor (or control processor) broadcasts a scalar value to all lanes.
Each lane contains one portion of the vector register file and one execution pipeline from
each vector functional unit. Each vector functional unit executes vector instructions at the rate
of one element group per cycle using multiple pipelines, one per lane. The first lane holds
the first element (element 0) for all vector registers, and so the first element in any vector in-
struction will have its source and destination operands located in the first lane. This allocation
allows the arithmetic pipeline local to the lane to complete the operation without communic-
ating with other lanes. Accessing main memory also requires only intralane wiring. Avoiding
interlane communication reduces the wiring cost and register file ports required to build a
highly parallel execution unit, and helps explain why vector computers can complete up to 64
operations per clock cycle (2 arithmetic units and 2 load/store units across 16 lanes).
Adding multiple lanes is a popular technique to improve vector performance as it requires
litle increase in control complexity and does not require changes to existing machine code. It
also allows designers to trade of die area, clock rate, voltage, and energy without sacriicing
peak performance. If the clock rate of a vector processor is halved, doubling the number of
lanes will retain the same potential performance.
 
Search WWH ::




Custom Search