Hardware Reference
In-Depth Information
were present, the instructions would need to be serialized and initiated in different convoys.
To keep the analysis simple, we assume that a convoy of instructions must complete execution
before any other instructions (scalar or vector) can begin execution.
It might seem that in addition to vector instruction sequences with structural hazards, se-
quences with read-after-write dependency hazards should also be in separate convoys, but
chaining allows them to be in the same convoy.
Chaining allows a vector operation to start as soon as the individual elements of its vector
source operand become available: The results from the first functional unit in the chain are
“forwarded” to the second functional unit. In practice, we often implement chaining by allow-
ing the processor to read and write a particular vector register at the same time, albeit to difer-
ent elements. Early implementations of chaining worked just like forwarding in scalar pipelin-
ing, but this restricted the timing of the source and destination instructions in the chain. Recent
implementations use flexible chaining , which allows a vector instruction to chain to essentially
any other active vector instruction, assuming that we don't generate a structural hazard. All
modern vector architectures support flexible chaining, which we assume in this chapter.
To turn convoys into execution time we need a timing metric to estimate the time for a con-
voy. It is called a chime , which is simply the unit of time taken to execute one convoy. Thus,
a vector sequence that consists of m convoys executes in m chimes; for a vector length of n ,
for VMIPS this is approximately m × n clock cycles. The chime approximation ignores some
processor-speciic overheads, many of which are dependent on vector length. Hence, measur-
ing time in chimes is a beter approximation for long vectors than for short ones. We will use
the chime measurement, rather than clock cycles per result, to indicate explicitly that we are
ignoring certain overheads.
If we know the number of convoys in a vector sequence, we know the execution time in
chimes. One source of overhead ignored in measuring chimes is any limitation on initiating
multiple vector instructions in a single clock cycle. If only one vector instruction can be ini-
tiated in a clock cycle (the reality in most vector processors), the chime count will underes-
timate the actual execution time of a convoy. Because the length of vectors is typically much
greater than the number of instructions in the convoy, we will simply assume that the convoy
executes in one chime.
Example
Show how the following code sequence lays out in convoys, assuming a single
copy of each vector functional unit:
LV V1,Rx ;load vector X
MULVS.D V2,V1,F0 ;vector-scalar multiply
LV V3,Ry ;load vector Y
ADDVV.D V4,V2,V3 ;add two vectors
SV V4,Ry ;store the sum
How many chimes will this vector sequence take? How many cycles per
FLOP (floating-point operation) are needed, ignoring vector instruction issue
overhead?
Answer
The first convoy starts with the first LV instruction. The MULVS.D is dependent on
the first LV , but chaining allows it to be in the same convoy.
Search WWH ::




Custom Search