Hardware Reference
In-Depth Information
ID
EX
MA
WB
FADD
ID
EX
MA
WB
FSUB
FIPR (Vector)
ID
E0
EX
MA
WB
FIPR (Vector)
FMUL
ID
E0
EX
MA
WB
ID
ID
EX
MA
WB
1 cycle stall
Fig. 3.24
Pipeline stall after E0 stage use
E1 E2 WB
FDS FDS FDS FDS FDS FDS FDS FDS FDS
FDIV
ID
ID
E1
E2
WB
FADD
FSUB
ID
E1
E2
WB
ID
E1
E2
WB
FMUL
(FDIV post process)
ID
E1
E2
WB
Fig. 3.25
Out-of-order completion of single-precision FDIV
The FDS block is for FDIV and FSQRT. The SH-4 adopts a SRT method with
carry-save adders, and the FDS block generates three bits of quotient or square-root
value per cycle. The numbers of bits of single- and double-precision mantissas are
24 and 53, respectively, and two extra bits, guard and round bits, are required to
generate the final result. Then, the FDS block takes 9 and 19 cycles to generate the
mantissas, and the pitches are 10 and 23 for the single- and double-precision FDIVs,
respectively. The differences are form some extra cycles before and after the man-
tissa generations. The pitches of the FSQRTs are one cycle shorter than the FDIV
with a special treatment at the beginning. The pitches are much longer than the other
instructions and degrade performance even though the frequency of the FDIV and
FSQRT is much less than the others. For example, if one of ten instructions is FDIV,
and the pitches of the other instructions are one, the total pitches are 19. Therefore,
an out-of-order completion of the FDIV and FSQRT is adopted to hide the long
pitches of them. Then only the FDS block is occupied for a long time. Figure 3.25
illustrates the out-of-order completion of single-precision FDIV.
The single-precision FDIV and FSQRT use the MAIN block for two cycles at the
beginning and ending of the operations to minimize the dedicated hardware for the
FDIV and FSQRT. The double-precision ones use it for five cycles, two cycles at
the beginning and three cycles at the ending. Then, the MAIN block is released to
the following instructions for the other cycles of the FDIV and FSQRT.
The double-precision instructions other than the FDIV and FSQRT are emulated
by hardware for single-precision instructions with small amount of additional hardware
for the emulation. Since the SH-4 merged an integer multiplier into the FPU, it sup-
ports 32-bit multiplication and 64-bit addition for an integer multiply-and-accumulate
 
Search WWH ::




Custom Search