Seven-stage superpipeline structure of SH-X
frequency can be 1.4 times as high as the SH-4. The degradation from the 1.5 times is
caused by the increase of pipeline latches for the extra stage.
The control signals and processing data are flowing to the backward as well as
fall through the pipeline. The backward flows convey various information and exe-
cution results of the preceding instructions to control and execute the following
instructions. The information includes that preceding instructions were issued or
still occupying resources, where the latest value of the source operand is flowing in
the pipeline, and so on. Such information is used for an instruction issue every
cycle, and it is necessary to collect the latest information in a cycle. This informa-
tion gathering and handling become difficult if a cycle time becomes short for the
superpipeline architecture, and the issue control logic tends to be complicated and
large. However, the quantity of hardware is determined mainly by the major micro-
architecture, and the hardware increase was expected to be less than 1.4 times.
A conventional seven-stage pipeline had less cycle performance than a five-stage
one by 20%. This means the performance gain of the superpipeline architecture was
only 1.4 × 0.8 = 1.12 times, which would not compensate the hardware increase. The
branch penalty increased by the increase of the instruction fetch cycles of I1 and I2
stages. The load-use conflict penalty increased by the increase of the data load
cycles of E1 and E2 stages. They were the main reason of the 20% degradation.
Figure 3.7 illustrates the seven-stage superpipeline structure of the SH-X with
delayed execution, store buffer, out-of-order branch, and flexible forwarding.
Compared to the conventional pipeline shown in Fig. 3.6 , the INT pipeline starts its
execution one cycle later at the E2 stage, a store data is buffered to the store buffer
at the E4 stage and stored to the data cache at the E5 stage, and the data transfer of
the FPU supports flexible forwarding. The BR pipeline starts at the ID stage, but is
not synchronized to the other pipelines for an out-of-order branch issue.
The delayed execution is effective to reduce the load-use conflict as Fig. 3.8
illustrates. It also lengthens the decoding stages into two except for the address
calculation and relaxes the decoding time. With the conventional architecture shown
in Fig. 3.6 , a load instruction, MOV.L, sets up an R0 value at the ID stage, calculates
a load address at the E1 stage, loads a data from the data cache at the E2 and E3