F/Fs with CCP
ph1 edge trigger F/F
CCP: Control Clock Pin
GCKD: Gated Clock Driver Cell
ph2 transparent latch
Conventional clock-gating method
the number of entries about 10 or 20 times without the tag. Although the different
branch instructions could not be distinguished without the tag and there occurred
a false hit, the merit of the entry increase exceeded the demerit of the false hit.
A global history method was also popular for the prediction and usually used with
the 2-bit/entry BHT.
The SH-X stalled only two cycles for the prediction miss, and the performance
was not so sensitive to the hit ratio. Further, the one-bit method required a state
change only for a prediction miss, and it could be done during the stall. Therefore,
the SH-X adopted a dynamic branch prediction method with a 4 K-entry 1-bit/entry
BHT and a global history. The size was much smaller than the instruction and data
caches of 32 KB each.
Low-Power Technologies of SH-X
The SH-X achieved excellent power efficiency by using various low-power tech-
nologies. Among them, hierarchical clock gating and pointer controlled pipeline are
explained in this section.
Figure 3.12 illustrates a conventional clock-gating method. In this example, the
clock tree has four levels with A-, B-, C-, and D-drivers. The A-driver receives the
clock from the clock generator and distributes the clock to each module in the processor.
Then, the B-driver of each module receives the clock and distributes it to various sub-
modules including 128-256 flip-flops (F/Fs). The B-driver gates the clock with the
signal from the clock control register, whose value is statically written by software to
stop and start the modules. Next, the C- and D-drivers distribute the clock hierarchi-
cally to the leaf F/Fs with a Control Clock Pin (CCP). The leaf F/Fs are gated by
hardware with the CCP to avoid activating them unnecessarily. However, the clock
tree in the module is always active while the module is activated by software.
Figure 3.13 illustrates the clock-gating method of the SH-X. In addition to the
clock gating at the B-driver, the C-drivers gate the clock with the signals dynamically
generated by hardware to reduce the clock tree activity. As a result, the clock power
is 30% less than that of the conventional method.
The superpipeline architecture improved operating frequency, but increased
number of F/Fs and power. Therefore, one of the key design considerations was