Processor Cores - Heterogeneous Multicore Processor Technologies for Embedded Systems - page 60

Hardware Reference

In-Depth Information

Register Read

Forwarding

Register Read

E1

E2

FDS

FLS

FPOLY

Short

E3

Main

E4

E5

Register Write

E6

E7

Register Write

LS

FE

Fig. 3.34

Arithmetic execution pipeline of SH-X FPU

We decided the vector instructions to be standard ones of the SH-X, which were

optional ones of the SH-4, and the SH-X merged the vector hardware and optimized

the merged hardware. Then the latencies of the most instructions became less than 1.5

times of the SH-4, and all the instructions could use the vector hardware if necessary.

There were weak requirements of high-speed double-precision operations when the

SH-4 was developed and chose the hardware emulation to implement them. However,

they could use the vector hardware and became faster mainly with the wider read/

write register ports and the more multipliers in the SH-X implementation.

Figure 3.34 illustrates the FPU arithmetic execution pipeline. With the delayed

execution architecture, the register-operand read and forwarding are done at the E1

stage, and the arithmetic operation starts at E2. The short arithmetic pipeline treats

three-cycle-latency instructions. All the arithmetic pipelines share one register write

port to reduce the number of ports. There are four forwarding source points to provide

the specified latencies for any cycle distance of the define-and-use instructions. The

FDS pipeline is occupied by 13/28 cycles to execute a single/double FDIV or FSQRT,

and these instructions cannot be issued frequently. The FPOLY pipeline is three cycles

long and is occupied three or five times to execute an FSRRA or FSCA instruction.

Therefore, the third E4 stage and E6 stage of the main pipeline are synchronized for

the FSRRA, and the FPOLY pipeline output merges with the main pipeline at this

point. The FSCA produce two outputs, and the first output is produced at the same

timing of the FSRRA, and the second one is produced two cycles later, and the main

pipeline is occupied for three cycles, although the second cycle is not used. The

FSRRA and FSCA are implemented by calculating the cubic polynomials of the prop-

erly divided periods. The width of the third order term is eight bits, which adds only a

small area overhead, while enhancing accuracy and reducing latency.

Figure 3.35 illustrates the structure of the main FPU pipeline. There are four

single-precision multiplier arrays at E2 to execute FIPR and FTRV and to emulate

Next Page

Heterogeneous Multicore Processor Technologies for Embedded Systems

Search WWH ::

Custom Search

Home