Hardware Reference
In-Depth Information
The second LV instruction must be in a separate convoy since there is a struc-
tural hazard on the load/store unit for the prior LV instruction. The ADDVV.D is de-
pendent on the second LV , but it can again be in the same convoy via chaining.
Finally, the SV has a structural hazard on the LV in the second convoy, so it must
go in the third convoy. This analysis leads to the following layout of vector in-
structions into convoys:
1. LV
MULVS.D
2. LV
ADDVV.D
3. SV
The sequence requires three convoys. Since the sequence takes three chimes and
there are two floating-point operations per result, the number of cycles per FLOP
is 1.5 (ignoring any vector instruction issue overhead). Note that, although we
allow the LV and MULVS.D both to execute in the first convoy, most vector machines
will take two clock cycles to initiate the instructions.
This example shows that the chime approximation is reasonably accurate for
long vectors. For example, for 64-element vectors, the time in chimes is 3, so the
sequence would take about 64 × 3 or 192 clock cycles. The overhead of issuing
convoys in two separate clock cycles would be small.
Another source of overhead is far more significant than the issue limitation. The most im-
portant source of overhead ignored by the chime model is vector start-up time . The start-up
time is principally determined by the pipelining latency of the vector functional unit. For
VMIPS, we will use the same pipeline depths as the Cray-1, although latencies in more modern
processors have tended to increase, especially for vector loads. All functional units are fully
pipelined. The pipeline depths are 6 clock cycles for floating-point add, 7 for floating-point
multiply, 20 for floating-point divide, and 12 for vector load.
Given these vector basics, the next several subsections will give optimizations that either
improve the performance or increase the types of programs that can run well on vector archi-
tectures. In particular, they will answer the questions:
■ How can a vector processor execute a single vector faster than one element per clock cycle?
Multiple elements per clock cycle improve performance.
■ How does a vector processor handle programs where the vector lengths are not the same as
the length of the vector register (64 for VMIPS)? Since most application vectors don't match
the architecture vector length, we need an efficient solution to this common case.
■ What happens when there is an IF statement inside the code to be vectorized? More code
can vectorize if we can efficiently handle conditional statements.
■ What does a vector processor need from the memory system? Without sufficient memory
bandwidth, vector execution can be futile.
■ How does a vector processor handle multiple dimensional matrices? This popular data
structure must vectorize for vector architectures to do well.
■ How does a vector processor handle sparse matrices? This popular data structure must vec-
torize also.
Search WWH ::




Custom Search