Hardware Reference
In-Depth Information
FIGURE 3.39 The estimated composition of the CPI on the ARM A8 shows that pipeline
stalls are the primary addition to the base CPI . eon deserves some special mention, as it
does integer-based graphics calculations (ray tracing) and has very few cache misses. It is
computationally intensive with heavy use of multiples, and the single multiply pipeline be-
comes a major bottleneck. This estimate is obtained by using the L1 and L2 miss rates and
penalties to compute the L1 and L2 generated stalls per instruction. These are subtracted
from the CPI measured by a detailed simulator to obtain the pipeline stalls. Pipeline stalls in-
clude all three hazards plus minor effects such as way misprediction.
The insight that the pipeline stalls created significant performance losses probably played
a key role in the decision to make the ARM Cortex-A9 a dynamically scheduled superscalar.
The A9, like the A8, issues up to two instructions per clock, but it uses dynamic scheduling
and speculation. Up to four pending instructions (two ALUs, one load/store or FP/multimedia,
and one branch) can begin execution in a clock cycle. The A9 uses a more powerful branch
predictor, instruction cache prefetch, and a nonblocking L1 data cache. Figure 3.40 shows that
the A9 outperforms the A8 by a factor of 1.28 on average, assuming the same clock rate and
virtually identical cache configurations.
 
Search WWH ::




Custom Search