Data-Level Parallelism in Vector, SIMD, and GPU Architectures - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

were present, the instructions would need to be serialized and initiated in different convoys.

To keep the analysis simple, we assume that a convoy of instructions must complete execution

before any other instructions (scalar or vector) can begin execution.

It might seem that in addition to vector instruction sequences with structural hazards, se-

quences with read-after-write dependency hazards should also be in separate convoys, but

chaining allows them to be in the same convoy.

Chaining allows a vector operation to start as soon as the individual elements of its vector

source operand become available: The results from the first functional unit in the chain are

“forwarded” to the second functional unit. In practice, we often implement chaining by allow-

ing the processor to read and write a particular vector register at the same time, albeit to difer-

ent elements. Early implementations of chaining worked just like forwarding in scalar pipelin-

ing, but this restricted the timing of the source and destination instructions in the chain. Recent

implementations use flexible chaining , which allows a vector instruction to chain to essentially

any other active vector instruction, assuming that we don't generate a structural hazard. All

modern vector architectures support flexible chaining, which we assume in this chapter.

To turn convoys into execution time we need a timing metric to estimate the time for a con-

voy. It is called a chime , which is simply the unit of time taken to execute one convoy. Thus,

a vector sequence that consists of m convoys executes in m chimes; for a vector length of n ,

for VMIPS this is approximately m × n clock cycles. The chime approximation ignores some

processor-speciic overheads, many of which are dependent on vector length. Hence, measur-

ing time in chimes is a beter approximation for long vectors than for short ones. We will use

the chime measurement, rather than clock cycles per result, to indicate explicitly that we are

ignoring certain overheads.

If we know the number of convoys in a vector sequence, we know the execution time in

chimes. One source of overhead ignored in measuring chimes is any limitation on initiating

multiple vector instructions in a single clock cycle. If only one vector instruction can be ini-

tiated in a clock cycle (the reality in most vector processors), the chime count will underes-

timate the actual execution time of a convoy. Because the length of vectors is typically much

greater than the number of instructions in the convoy, we will simply assume that the convoy

executes in one chime.

Example

Show how the following code sequence lays out in convoys, assuming a single

copy of each vector functional unit:

LV V1,Rx ;load vector X

MULVS.D V2,V1,F0 ;vector-scalar multiply

LV V3,Ry ;load vector Y

ADDVV.D V4,V2,V3 ;add two vectors

SV V4,Ry ;store the sum

How many chimes will this vector sequence take? How many cycles per

FLOP (floating-point operation) are needed, ignoring vector instruction issue

overhead?

Answer

The first convoy starts with the first LV instruction. The MULVS.D is dependent on

the first LV , but chaining allows it to be in the same convoy.

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home