Hardware Reference
In-Depth Information
Register Read
Forwarding
Register Read
E1
E2
FDS
FLS
FPOLY
Short
E3
Main
E4
E5
Register Write
E6
E7
Register Write
LS
FE
Fig. 3.34
Arithmetic execution pipeline of SH-X FPU
We decided the vector instructions to be standard ones of the SH-X, which were
optional ones of the SH-4, and the SH-X merged the vector hardware and optimized
the merged hardware. Then the latencies of the most instructions became less than 1.5
times of the SH-4, and all the instructions could use the vector hardware if necessary.
There were weak requirements of high-speed double-precision operations when the
SH-4 was developed and chose the hardware emulation to implement them. However,
they could use the vector hardware and became faster mainly with the wider read/
write register ports and the more multipliers in the SH-X implementation.
Figure 3.34 illustrates the FPU arithmetic execution pipeline. With the delayed
execution architecture, the register-operand read and forwarding are done at the E1
stage, and the arithmetic operation starts at E2. The short arithmetic pipeline treats
three-cycle-latency instructions. All the arithmetic pipelines share one register write
port to reduce the number of ports. There are four forwarding source points to provide
the specified latencies for any cycle distance of the define-and-use instructions. The
FDS pipeline is occupied by 13/28 cycles to execute a single/double FDIV or FSQRT,
and these instructions cannot be issued frequently. The FPOLY pipeline is three cycles
long and is occupied three or five times to execute an FSRRA or FSCA instruction.
Therefore, the third E4 stage and E6 stage of the main pipeline are synchronized for
the FSRRA, and the FPOLY pipeline output merges with the main pipeline at this
point. The FSCA produce two outputs, and the first output is produced at the same
timing of the FSRRA, and the second one is produced two cycles later, and the main
pipeline is occupied for three cycles, although the second cycle is not used. The
FSRRA and FSCA are implemented by calculating the cubic polynomials of the prop-
erly divided periods. The width of the third order term is eight bits, which adds only a
small area overhead, while enhancing accuracy and reducing latency.
Figure 3.35 illustrates the structure of the main FPU pipeline. There are four
single-precision multiplier arrays at E2 to execute FIPR and FTRV and to emulate
 
Search WWH ::




Custom Search