Hardware Reference
In-Depth Information
Here is the VMIPS code for DAXPY.
L.D F0,a ;load scalar a
LV V1,Rx ;load vector X
MULVS.D V2,V1,F0 ;vector-scalar multiply
LV V3,Ry ;load vector Y
ADDVV.D V4,V2,V3 ;add
SV V4,Ry ;store the result
The most dramatic difference is that the vector processor greatly reduces the
dynamic instruction bandwidth, executing only 6 instructions versus almost
600 for MIPS. This reduction occurs because the vector operations work on 64
elements and the overhead instructions that constitute nearly half the loop on
MIPS are not present in the VMIPS code. When the compiler produces vector
instructions for such a sequence and the resulting code spends much of its time
running in vector mode, the code is said to be vectorized or vectorizable . Loops
can be vectorized when they do not have dependences between iterations of a
loop, which are called loop-carried dependences (see Section 4.5 ).
Another important difference between MIPS and VMIPS is the frequency of
pipeline interlocks. In the straightforward MIPS code, every ADD.D must wait for
a MUL.D , and every S.D must wait for the ADD.D . On the vector processor, each vec-
tor instruction will only stall for the first element in each vector, and then sub-
sequent elements will flow smoothly down the pipeline. Thus, pipeline stalls
are required only once per vector instruction , rather than once per vector ele-
ment . Vector architects call forwarding of element-dependent operations chain-
ing , in that the dependent operations are “chained” together. In this example,
the pipeline stall frequency on MIPS will be about 64× higher than it is on
VMIPS. Software pipelining or loop unrolling (Appendix H) can reduce the
pipeline stalls on MIPS; however, the large difference in instruction bandwidth
cannot be reduced substantially.
Vector Execution Time
The execution time of a sequence of vector operations primarily depends on three factors: (1)
the length of the operand vectors, (2) structural hazards among the operations, and (3) the data
dependences. Given the vector length and the initiation rate , which is the rate at which a vector
unit consumes new operands and produces new results, we can compute the time for a single
vector instruction. All modern vector computers have vector functional units with multiple
parallel pipelines (or lanes ) that can produce two or more results per clock cycle, but they may
also have some functional units that are not fully pipelined. For simplicity, our VMIPS imple-
mentation has one lane with an initiation rate of one element per clock cycle for individual
operations. Thus, the execution time in clock cycles for a single vector instruction is approx-
imately the vector length.
To simplify the discussion of vector execution and vector performance, we use the notion of
a convoy , which is the set of vector instructions that could potentially execute together. As we
shall soon see, you can estimate performance of a section of code by counting the number of
convoys. The instructions in a convoy must not contain any structural hazards; if such hazards
Search WWH ::




Custom Search