Processor Cores - Heterogeneous Multicore Processor Technologies for Embedded Systems

Hardware Reference

In-Depth Information

An SH-X3, the third-generation core, supported multicore features for both SMP

and AMP [ 29, 30 ]. It was developed using a 90-nm generic process and achieved

600 MHz and 1,080 MIPS with 360 mW, resulting in 3,000 MIPS/W and 3.2

GIPS 2 /W. The first prototype chip of the SH-X3 was a RP-1 that integrated four

SH-X3 cores [ 31- 34 ], and the second one was a RP-2 that integrated eight SH-X3

cores [ 35- 37 ]. Then, it was ported to a 65-nm low-power process and used for prod-

uct chips [ 38 ]. The design is discussed in Sect. 3.1.7 .

An SH-X4, the latest fourth generation of the SH-4A processor core series,

achieved 648 MHz and 1,717 MIPS with 106 mW, resulting in 16,240 MIPS/W and

28 GIPS 2 /W using a 45-nm process [ 39- 41 ]. The design is discussed in Sect. 3.1.8 .

3.1.2

Ef fi cient Parallelization of SH-4

The SH-4 enhanced its performance and efficiency mainly with superscalar archi-

tecture, which is suitable for multimedia processing having high parallelism, and

makes an embedded processor suitable for digital appliances. However, a conven-

tional superscalar processor put the first priority to performance, and efficiency was

not considered seriously, because it was a high-end processor for a PC/server

[ 42- 46 ]. Therefore, a highly efficient superscalar architecture was developed and

adopted to the SH-4. The design target was to adopt the superscalar architecture to

an embedded processor with maintaining its efficiency, which was already high

enough and much higher than that of a high-end processor.

A high-end general-purpose processor was designed to enhance general perfor-

mance for PC/server use. However, no serious restriction caused low efficiency.

A program with low parallelism cannot use the parallelism of a highly parallel

superscalar processor, and the efficiency of the processor degrades. Therefore, the

target parallelism of the superscalar architecture was set for the programs with rela-

tively low parallelism, and performance enhancement of the multimedia processing

was accomplished in another way (see Sect. 3.1.5 ).

The superscalar architecture enhances peak performance by simultaneous issue

of plural instructions. However, effective performance of the real application is

estranged from peak performance when the number of the instruction issue

increases. The estrangement between the peak and effective performance is caused

by hazard of waiting cycles. A branch operation mainly causes the waiting cycles

for a fetched instruction, and it is important to speed up the branch efficiently.

A resource conflict, which causes the waiting cycles for a resource to be available,

can be reduced by the resource addition. However, the efficiency will decrease if

the performance enhancement does not compensate the hardware amount of the

additional resource. Therefore, balanced resource addition is necessary to main-

tain the efficiency. The register conflict, which causes the waiting cycles for a

register value to be available, can be reduced by shortening instruction execution

time and by data forwarding from a data-definition instruction to a data-use one at

appropriate timing.

Search WWH ::

Custom Search

Home