Hardware Reference
In-Depth Information
An SH-X3, the third-generation core, supported multicore features for both SMP
and AMP [ 29, 30 ]. It was developed using a 90-nm generic process and achieved
600 MHz and 1,080 MIPS with 360 mW, resulting in 3,000 MIPS/W and 3.2
GIPS 2 /W. The first prototype chip of the SH-X3 was a RP-1 that integrated four
SH-X3 cores [ 31- 34 ], and the second one was a RP-2 that integrated eight SH-X3
cores [ 35- 37 ]. Then, it was ported to a 65-nm low-power process and used for prod-
uct chips [ 38 ]. The design is discussed in Sect. 3.1.7 .
An SH-X4, the latest fourth generation of the SH-4A processor core series,
achieved 648 MHz and 1,717 MIPS with 106 mW, resulting in 16,240 MIPS/W and
28 GIPS 2 /W using a 45-nm process [ 39- 41 ]. The design is discussed in Sect. 3.1.8 .
3.1.2
Ef fi cient Parallelization of SH-4
The SH-4 enhanced its performance and efficiency mainly with superscalar archi-
tecture, which is suitable for multimedia processing having high parallelism, and
makes an embedded processor suitable for digital appliances. However, a conven-
tional superscalar processor put the first priority to performance, and efficiency was
not considered seriously, because it was a high-end processor for a PC/server
[ 42- 46 ]. Therefore, a highly efficient superscalar architecture was developed and
adopted to the SH-4. The design target was to adopt the superscalar architecture to
an embedded processor with maintaining its efficiency, which was already high
enough and much higher than that of a high-end processor.
A high-end general-purpose processor was designed to enhance general perfor-
mance for PC/server use. However, no serious restriction caused low efficiency.
A program with low parallelism cannot use the parallelism of a highly parallel
superscalar processor, and the efficiency of the processor degrades. Therefore, the
target parallelism of the superscalar architecture was set for the programs with rela-
tively low parallelism, and performance enhancement of the multimedia processing
was accomplished in another way (see Sect. 3.1.5 ).
The superscalar architecture enhances peak performance by simultaneous issue
of plural instructions. However, effective performance of the real application is
estranged from peak performance when the number of the instruction issue
increases. The estrangement between the peak and effective performance is caused
by hazard of waiting cycles. A branch operation mainly causes the waiting cycles
for a fetched instruction, and it is important to speed up the branch efficiently.
A resource conflict, which causes the waiting cycles for a resource to be available,
can be reduced by the resource addition. However, the efficiency will decrease if
the performance enhancement does not compensate the hardware amount of the
additional resource. Therefore, balanced resource addition is necessary to main-
tain the efficiency. The register conflict, which causes the waiting cycles for a
register value to be available, can be reduced by shortening instruction execution
time and by data forwarding from a data-definition instruction to a data-use one at
appropriate timing.
Search WWH ::




Custom Search