Processor Cores - Heterogeneous Multicore Processor Technologies for Embedded Systems

Hardware Reference

In-Depth Information

The double-precision implementation will be explained later, but it was imple-

mented faster than that of the SH-4, and the load/store/transfer instructions had also

to be faster for the performance balance. Therefore, a double-precision mode was

defined as well as the normal and pair modes of the single precision by using the

FPSCR.PR and SZ bits for the FMOV to treat double-precision data. Further, a

floating-point precision change instruction (FPCHG) was defined for fast precision-

mode change as well as the FRCHG and FSCHG described in Sect. 3.1.5.1 .

3.1.6.2

High-Frequency Implementation of the SH-X FPU

The SH-X FPU achieved 1.4 times of the SH-4 frequency in a same process with

maintaining or enhancing the cycle performance. Table 3.8 shows the pitches and

latencies of the FE-category instructions of the SH-3E, SH-4, and SH-X. As for the

SH-X, the simple single-precision instructions of FADD, FSUB, FLOAT, and FTRC

have three-cycle latencies. Both single- and double-precision FCMPs have two-

cycle latencies. Other single-precision instructions of FMUL, FMAC, and FIPR and

the double-precision instructions except FMUL, FCMP, FDIV, and FSQRT have

five-cycle latencies. All the above instructions have one-cycle pitches. The FTRV

consists of four FIPR like operations resulting in four-cycle pitch and eight-cycle

latency. The FDIV and FSQRT are out-of-order completion instructions having

two-cycle pitches for the first and last cycles to initiate a special resource operation

and to perform postprocesses of normalizing and rounding of the result. Their

pitches of the special hardware expressed in the parentheses are about halves of the

mantissa widths, and the latencies are four cycles more than the special-hardware

pitches. The FSRRA has one-cycle pitch, three-cycle pitch of the special hardware,

and five-cycle latency. The FSCA has three-cycle pitch, five-cycle pitch of the spe-

cial hardware, and seven-cycle latency. The double-precision FMUL has three-cycle

pitch and seven-cycle latency.

Multiply-accumulate (MAC) is one of the most frequent operations in intensive

computing applications. The use of four-way SIMD would achieve the same

throughput as the FIPR, but the latency was longer, and the register file had to be

larger. Figure 3.31 illustrates an example of the differences according to the pitches

and latencies of the FE-category SH-X instructions shown in Table 3.8 . In this

example, each box shows an operation issue slot. Since FMUL and FMAC have

five-cycle latencies, we must issue 20 independent operations for peak throughput

in the case of four-way SIMD. The result is available 20 cycles after the FMUL

issue. On the other hand, five independent operations are enough to get the peak

throughput of a program using FIPRs. Therefore, FIPR requires one-quarter of the

program's parallelism and latency.

Figure 3.32 compares the pitch and latency of an FSRRA and the equivalent

sequence of an FSQRT and an FDIV according to Table 3.8 . Each of the FSQRT

and FDIV occupies 2 and 13 cycles of the MAIN FPU and special resources, respec-

tively, and takes 17 cycles to get the result, and the result is available 34 cycles after

the issue of the FSQRT. In contrast, the pitch and latency of the FSRRA are one and

Heterogeneous Multicore Processor Technologies for Embedded Systems

Search WWH ::

Custom Search

Home