Hardware Reference
In-Depth Information
The double-precision implementation will be explained later, but it was imple-
mented faster than that of the SH-4, and the load/store/transfer instructions had also
to be faster for the performance balance. Therefore, a double-precision mode was
defined as well as the normal and pair modes of the single precision by using the
FPSCR.PR and SZ bits for the FMOV to treat double-precision data. Further, a
floating-point precision change instruction (FPCHG) was defined for fast precision-
mode change as well as the FRCHG and FSCHG described in Sect. 3.1.5.1 .
3.1.6.2
High-Frequency Implementation of the SH-X FPU
The SH-X FPU achieved 1.4 times of the SH-4 frequency in a same process with
maintaining or enhancing the cycle performance. Table 3.8 shows the pitches and
latencies of the FE-category instructions of the SH-3E, SH-4, and SH-X. As for the
SH-X, the simple single-precision instructions of FADD, FSUB, FLOAT, and FTRC
have three-cycle latencies. Both single- and double-precision FCMPs have two-
cycle latencies. Other single-precision instructions of FMUL, FMAC, and FIPR and
the double-precision instructions except FMUL, FCMP, FDIV, and FSQRT have
five-cycle latencies. All the above instructions have one-cycle pitches. The FTRV
consists of four FIPR like operations resulting in four-cycle pitch and eight-cycle
latency. The FDIV and FSQRT are out-of-order completion instructions having
two-cycle pitches for the first and last cycles to initiate a special resource operation
and to perform postprocesses of normalizing and rounding of the result. Their
pitches of the special hardware expressed in the parentheses are about halves of the
mantissa widths, and the latencies are four cycles more than the special-hardware
pitches. The FSRRA has one-cycle pitch, three-cycle pitch of the special hardware,
and five-cycle latency. The FSCA has three-cycle pitch, five-cycle pitch of the spe-
cial hardware, and seven-cycle latency. The double-precision FMUL has three-cycle
pitch and seven-cycle latency.
Multiply-accumulate (MAC) is one of the most frequent operations in intensive
computing applications. The use of four-way SIMD would achieve the same
throughput as the FIPR, but the latency was longer, and the register file had to be
larger. Figure 3.31 illustrates an example of the differences according to the pitches
and latencies of the FE-category SH-X instructions shown in Table 3.8 . In this
example, each box shows an operation issue slot. Since FMUL and FMAC have
five-cycle latencies, we must issue 20 independent operations for peak throughput
in the case of four-way SIMD. The result is available 20 cycles after the FMUL
issue. On the other hand, five independent operations are enough to get the peak
throughput of a program using FIPRs. Therefore, FIPR requires one-quarter of the
program's parallelism and latency.
Figure 3.32 compares the pitch and latency of an FSRRA and the equivalent
sequence of an FSQRT and an FDIV according to Table 3.8 . Each of the FSQRT
and FDIV occupies 2 and 13 cycles of the MAIN FPU and special resources, respec-
tively, and takes 17 cycles to get the result, and the result is available 34 cycles after
the issue of the FSQRT. In contrast, the pitch and latency of the FSRRA are one and
Search WWH ::




Custom Search