Processor Cores - Heterogeneous Multicore Processor Technologies for Embedded Systems - page 33

Hardware Reference

In-Depth Information

I1

I2

ID

E1

E2

E3

E4

E5

E6

E7

Out-of-order

Branch

Instruction Fetch

Branch

Instruction

Decoding

FPU Instruction

Decoding

Address

Tag

Data

Load

FPU

Data

Transfer

Execution

FPU

Arithmetic

Execution

-

WB

WB

Data

Store

WB

Store Buffer

WB

Flexible Forwarding

BR

INT

LS

FE

Fig. 3.7

Seven-stage superpipeline structure of SH-X

frequency can be 1.4 times as high as the SH-4. The degradation from the 1.5 times is

caused by the increase of pipeline latches for the extra stage.

The control signals and processing data are flowing to the backward as well as

fall through the pipeline. The backward flows convey various information and exe-

cution results of the preceding instructions to control and execute the following

instructions. The information includes that preceding instructions were issued or

still occupying resources, where the latest value of the source operand is flowing in

the pipeline, and so on. Such information is used for an instruction issue every

cycle, and it is necessary to collect the latest information in a cycle. This informa-

tion gathering and handling become difficult if a cycle time becomes short for the

superpipeline architecture, and the issue control logic tends to be complicated and

large. However, the quantity of hardware is determined mainly by the major micro-

architecture, and the hardware increase was expected to be less than 1.4 times.

A conventional seven-stage pipeline had less cycle performance than a five-stage

one by 20%. This means the performance gain of the superpipeline architecture was

only 1.4 × 0.8 = 1.12 times, which would not compensate the hardware increase. The

branch penalty increased by the increase of the instruction fetch cycles of I1 and I2

stages. The load-use conflict penalty increased by the increase of the data load

cycles of E1 and E2 stages. They were the main reason of the 20% degradation.

Figure 3.7 illustrates the seven-stage superpipeline structure of the SH-X with

delayed execution, store buffer, out-of-order branch, and flexible forwarding.

Compared to the conventional pipeline shown in Fig. 3.6 , the INT pipeline starts its

execution one cycle later at the E2 stage, a store data is buffered to the store buffer

at the E4 stage and stored to the data cache at the E5 stage, and the data transfer of

the FPU supports flexible forwarding. The BR pipeline starts at the ID stage, but is

not synchronized to the other pipelines for an out-of-order branch issue.

The delayed execution is effective to reduce the load-use conflict as Fig. 3.8

illustrates. It also lengthens the decoding stages into two except for the address

calculation and relaxes the decoding time. With the conventional architecture shown

in Fig. 3.6 , a load instruction, MOV.L, sets up an R0 value at the ID stage, calculates

a load address at the E1 stage, loads a data from the data cache at the E2 and E3

Next Page

Heterogeneous Multicore Processor Technologies for Embedded Systems

Search WWH ::

Custom Search

Home