Processor Cores - Heterogeneous Multicore Processor Technologies for Embedded Systems - page 36

Hardware Reference

In-Depth Information

I1

I2

ID

ID

E1 E2 E3 E4

I1

I2

E1 E2 E3 E4

E1 E2 E3 E4

E1 E2 E3 E4

I1

I2

IQ

ID

ID

I1

I2

IQ

I2

I1

IQ

ID

E1 E2 E3 E4

Compare

I1

I2

IQ

ID

E1 E2 E3 E4

Branch

Delay Slot

I1

I2

ID

I1

I2

IQ IQ

ID

E1 E2 E3 E4

Target

I1

I2

ID

E1 E2 E3 E4

I1

I2

ID

E1 E2 E3 E4

I1

I2

IQ

ID

ID

E1 E2 E3 E4

E1 E2 E3 E4

I1

I2

IQ

Fall through

(Prediction miss)

I1

I2

IQ

IQ

IQ

IQ

IQ

IQ

IQ

IQ

IQ

IQ

IQ

IQ

IQ

IQ

IQ

IQ

ID

ID

E1 E2 E3 E4

E1 E2 E3 E4

I1

I2

I1

I2

IQ

ID

ID

E1 E2 E3 E4

E1 E2 E3 E4

I1

I2

IQ

2-cycle stall

Fig. 3.11

Branch execution sequence of SH-X

direction that the branch is taken or not taken. However, this is not early enough to

make the empty issue slots zero. Therefore, the SH-X adopted an out-of-order issue

to the branches using no general-purpose register.

The SH-X fetches four instructions per cycle and issues two instructions at most.

Therefore, instructions are buffered in an instruction queue (IQ) as illustrated. A branch

instruction is searched from the IQ or an instruction-cache output at the I2 stage and

provided to the ID stage of the branch pipeline for the out-of-order issue earlier than

the other instructions provided to the ID stage in order. Then the conditional branch

instruction is issued right after it is fetched, while the preceding instructions are in the

IQ, and the issue becomes early enough to make the empty issue slots zero. As a result,

the target instruction is fetched and decoded at the ID stage right after the delay-slot

instruction. This means no branch penalty occurs in the sequence when the preceding

or delay-slot instructions stay two or more cycles in the IQ.

The compare result is available at the E3 stage, and the prediction is checked if it is

hit or miss. In the miss case, the instruction of the correct flow is decoded at the ID stage

right after the E3 stage, and two-cycle stall occurs. If the correct flow is not held in the

IQ, the miss-prediction recovery starts from the I1 stage and takes two more cycles.

Historically, the dynamic branch prediction method started from a BHT with

1-bit history per entry, which recorded a branch direction of taken or not for the last

time, and predicted the same branch direction. Then a BHT with 2-bit history per

entry became popular, and the four direction states of strongly taken, weakly taken,

weakly not taken, and strongly not taken were used for the prediction to reflect the

history of several times. There were several types of the state transitions including

a simple up-down transition. Since each entry held only one or two bits, it is too

expensive to attach a tag consisting of a part of the branch-instruction address,

which was usually about 20 bits for a 32-bit addressing. Therefore, we could increase

Next Page

Heterogeneous Multicore Processor Technologies for Embedded Systems

Search WWH ::

Custom Search

Home