Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

Here is the VMIPS code for DAXPY.

L.D F0,a ;load scalar a

LV V1,Rx ;load vector X

MULVS.D V2,V1,F0 ;vector-scalar multiply

LV V3,Ry ;load vector Y

ADDVV.D V4,V2,V3 ;add

SV V4,Ry ;store the result

The most dramatic difference is that the vector processor greatly reduces the

dynamic instruction bandwidth, executing only 6 instructions versus almost

600 for MIPS. This reduction occurs because the vector operations work on 64

elements and the overhead instructions that constitute nearly half the loop on

MIPS are not present in the VMIPS code. When the compiler produces vector

instructions for such a sequence and the resulting code spends much of its time

running in vector mode, the code is said to be vectorized or vectorizable . Loops

can be vectorized when they do not have dependences between iterations of a

loop, which are called loop-carried dependences (see Section 4.5 ).

Another important difference between MIPS and VMIPS is the frequency of

pipeline interlocks. In the straightforward MIPS code, every ADD.D must wait for

a MUL.D , and every S.D must wait for the ADD.D . On the vector processor, each vec-

tor instruction will only stall for the first element in each vector, and then sub-

sequent elements will flow smoothly down the pipeline. Thus, pipeline stalls

are required only once per vector instruction , rather than once per vector ele-

ment . Vector architects call forwarding of element-dependent operations chain-

ing , in that the dependent operations are “chained” together. In this example,

the pipeline stall frequency on MIPS will be about 64× higher than it is on

VMIPS. Software pipelining or loop unrolling (Appendix H) can reduce the

pipeline stalls on MIPS; however, the large difference in instruction bandwidth

cannot be reduced substantially.

Vector Execution Time

The execution time of a sequence of vector operations primarily depends on three factors: (1)

the length of the operand vectors, (2) structural hazards among the operations, and (3) the data

dependences. Given the vector length and the initiation rate , which is the rate at which a vector

unit consumes new operands and produces new results, we can compute the time for a single

vector instruction. All modern vector computers have vector functional units with multiple

parallel pipelines (or lanes ) that can produce two or more results per clock cycle, but they may

also have some functional units that are not fully pipelined. For simplicity, our VMIPS imple-

mentation has one lane with an initiation rate of one element per clock cycle for individual

operations. Thus, the execution time in clock cycles for a single vector instruction is approx-

imately the vector length.

To simplify the discussion of vector execution and vector performance, we use the notion of

a convoy , which is the set of vector instructions that could potentially execute together. As we

shall soon see, you can estimate performance of a section of code by counting the number of

convoys. The instructions in a convoy must not contain any structural hazards; if such hazards

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home