Processor Cores - Heterogeneous Multicore Processor Technologies for Embedded Systems

Hardware Reference

In-Depth Information

Fig. 3.35 Main pipeline of

SH-X FPU

Exponent

Difference

Exponent

Adder

Multiplier

Array

Multiplier

Array

Multiplier

Array

Multiplier

Array

E2

Aligner

E3

Reduction Array

Carry Propagate

Adder (CPA)

Leading Non-zero

(LNZ) Detector

E4

Exponent

Normalizer

E5

Mantissa Normalizer

Rounder

E6

double-precision multiplication. Their total area is less than that of a double-precision

multiplier array. The calculation of exponent differences is also done at E2 for align-

ment operations by four aligners at E3. The four aligners align eight terms consisting

of four sets of sum and carry pairs of four products generated by the four multiplier

arrays, and a reduction array reduces the aligned eight terms to two at E3. The

exponent value before normalization is also calculated by an exponent adder at E3.

A carry-propagate adder (CPA) adds two terms from the reduction array, and a lead-

ing nonzero (LNZ) detector searches the LNZ position of the absolute value of the

CPA result from the two CPA inputs precisely and with the same speed as the CPA

at E4. Therefore, the result of the CPA can be normalized immediately after the CPA

operation with no correction of position errors, which is often necessary when using

a conventional 1-bit error LNZ detector. Mantissa and exponent normalizers nor-

malize the CPA and exponent-adder outputs at E5 controlled by the LNZ detector

output. Finally, the rounder rounds the normalized results into the ANSI/IEEE 754

format. The extra hardware required for the special FPU instructions of the FIPR,

FTRV, FSRRA, and FSCA is about 30% of the original FPU hardware, and the FPU

area is about 10-20% of the processor core depending on the size of the first and

second on-chip memories. Therefore, the extra hardware is about 3-6% of the

processor core.

The SH-4 used the FPU multiplier for integer multiplications, so the multiplier could

calculate a 32-by-32 multiplication. Therefore, the double-precision multiplication

could be divided into four parts. On the other hand, the SH-X separated the integer and

FPU multipliers to make the FPU optional. Then the FPU had four 24-by-24 multipliers

for the double-precision FMUL emulation. Since the double-precision mantissa width

was more than twice of the single-precision one, we had to divide a multiplication into

nine parts. Then we need three cycles to emulate the nine partial multiplications by four

multipliers.

Figure 3.36 illustrates the flow of the emulation. At the first step, a lower-by-lower

product is produced, and its lower 23 bits are added by the CPA. Then the CPA

output is ORed to generate a sticky bit. At the second step, four products of middle-

by-lower, lower-by-middle, upper-by-lower, and lower-by-upper are produced and

Heterogeneous Multicore Processor Technologies for Embedded Systems

Search WWH ::

Custom Search

Home