Hardware Reference
In-Depth Information
Fig. 3.35 Main pipeline of
SH-X FPU
Exponent
Difference
Exponent
Adder
Multiplier
Array
Multiplier
Array
Multiplier
Array
Multiplier
Array
E2
Aligner
Aligner
Aligner
Aligner
E3
Reduction Array
Carry Propagate
Adder (CPA)
Leading Non-zero
(LNZ) Detector
E4
Exponent
Normalizer
E5
Mantissa Normalizer
Rounder
E6
double-precision multiplication. Their total area is less than that of a double-precision
multiplier array. The calculation of exponent differences is also done at E2 for align-
ment operations by four aligners at E3. The four aligners align eight terms consisting
of four sets of sum and carry pairs of four products generated by the four multiplier
arrays, and a reduction array reduces the aligned eight terms to two at E3. The
exponent value before normalization is also calculated by an exponent adder at E3.
A carry-propagate adder (CPA) adds two terms from the reduction array, and a lead-
ing nonzero (LNZ) detector searches the LNZ position of the absolute value of the
CPA result from the two CPA inputs precisely and with the same speed as the CPA
at E4. Therefore, the result of the CPA can be normalized immediately after the CPA
operation with no correction of position errors, which is often necessary when using
a conventional 1-bit error LNZ detector. Mantissa and exponent normalizers nor-
malize the CPA and exponent-adder outputs at E5 controlled by the LNZ detector
output. Finally, the rounder rounds the normalized results into the ANSI/IEEE 754
format. The extra hardware required for the special FPU instructions of the FIPR,
FTRV, FSRRA, and FSCA is about 30% of the original FPU hardware, and the FPU
area is about 10-20% of the processor core depending on the size of the first and
second on-chip memories. Therefore, the extra hardware is about 3-6% of the
processor core.
The SH-4 used the FPU multiplier for integer multiplications, so the multiplier could
calculate a 32-by-32 multiplication. Therefore, the double-precision multiplication
could be divided into four parts. On the other hand, the SH-X separated the integer and
FPU multipliers to make the FPU optional. Then the FPU had four 24-by-24 multipliers
for the double-precision FMUL emulation. Since the double-precision mantissa width
was more than twice of the single-precision one, we had to divide a multiplication into
nine parts. Then we need three cycles to emulate the nine partial multiplications by four
multipliers.
Figure 3.36 illustrates the flow of the emulation. At the first step, a lower-by-lower
product is produced, and its lower 23 bits are added by the CPA. Then the CPA
output is ORed to generate a sticky bit. At the second step, four products of middle-
by-lower, lower-by-middle, upper-by-lower, and lower-by-upper are produced and
 
Search WWH ::




Custom Search