Hardware Reference
In-Depth Information
I1
I2
ID
ID
E1 E2 E3 E4
I1
I2
E1 E2 E3 E4
E1 E2 E3 E4
E1 E2 E3 E4
I1
I2
IQ
ID
ID
I1
I2
IQ
I2
I1
IQ
ID
E1 E2 E3 E4
Compare
I1
I2
IQ
ID
E1 E2 E3 E4
Branch
Delay Slot
I1
I2
ID
I1
I2
IQ IQ
ID
E1 E2 E3 E4
Target
I1
I2
ID
E1 E2 E3 E4
I1
I2
ID
E1 E2 E3 E4
I1
I2
IQ
ID
ID
E1 E2 E3 E4
E1 E2 E3 E4
I1
I2
IQ
Fall through
(Prediction miss)
I1
I2
IQ
IQ
IQ
IQ
IQ
IQ
IQ
IQ
IQ
IQ
IQ
IQ
IQ
IQ
IQ
IQ
ID
ID
E1 E2 E3 E4
E1 E2 E3 E4
I1
I2
I1
I2
IQ
ID
ID
E1 E2 E3 E4
E1 E2 E3 E4
I1
I2
IQ
2-cycle stall
Fig. 3.11
Branch execution sequence of SH-X
direction that the branch is taken or not taken. However, this is not early enough to
make the empty issue slots zero. Therefore, the SH-X adopted an out-of-order issue
to the branches using no general-purpose register.
The SH-X fetches four instructions per cycle and issues two instructions at most.
Therefore, instructions are buffered in an instruction queue (IQ) as illustrated. A branch
instruction is searched from the IQ or an instruction-cache output at the I2 stage and
provided to the ID stage of the branch pipeline for the out-of-order issue earlier than
the other instructions provided to the ID stage in order. Then the conditional branch
instruction is issued right after it is fetched, while the preceding instructions are in the
IQ, and the issue becomes early enough to make the empty issue slots zero. As a result,
the target instruction is fetched and decoded at the ID stage right after the delay-slot
instruction. This means no branch penalty occurs in the sequence when the preceding
or delay-slot instructions stay two or more cycles in the IQ.
The compare result is available at the E3 stage, and the prediction is checked if it is
hit or miss. In the miss case, the instruction of the correct flow is decoded at the ID stage
right after the E3 stage, and two-cycle stall occurs. If the correct flow is not held in the
IQ, the miss-prediction recovery starts from the I1 stage and takes two more cycles.
Historically, the dynamic branch prediction method started from a BHT with
1-bit history per entry, which recorded a branch direction of taken or not for the last
time, and predicted the same branch direction. Then a BHT with 2-bit history per
entry became popular, and the four direction states of strongly taken, weakly taken,
weakly not taken, and strongly not taken were used for the prediction to reflect the
history of several times. There were several types of the state transitions including
a simple up-down transition. Since each entry held only one or two bits, it is too
expensive to attach a tag consisting of a part of the branch-instruction address,
which was usually about 20 bits for a 32-bit addressing. Therefore, we could increase
 
Search WWH ::




Custom Search