Instruction-Level Parallelism and Its Exploitation - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

The ARM Cortex-A8

The A8 is a dual-issue, statically scheduled superscalar with dynamic issue detection, which

allows the processor to issue one or two instructions per clock. Figure 3.36 shows the basic

pipeline structure of the 13-stage pipeline.

The A8 uses a dynamic branch predictor with a 512-entry two-way set associative branch

target buffer and a 4K-entry global history buffer, which is indexed by the branch history and

the current PC. In the event that the branch target buffer misses, a prediction is obtained from

the global history buffer, which can then be used to compute the branch address. In addition,

an eight-entry return stack is kept to track return addresses. An incorrect prediction results in

a 13-cycle penalty as the pipeline is lushed.

Figure 3.37 shows the instruction decode pipeline. Up to two instructions per clock can be is-

sued using an in-order issue mechanism. A simple scoreboard structure is used to track when

an instruction can issue. A pair of dependent instructions can be processed through the issue

logic, but, of course, they will be serialized at the scoreboard, unless they can be issued so that

the forwarding paths can resolve the dependence.

FIGURE 3.37 The five-stage instruction decode of the A8 . In the first stage, a PC pro-

duced by the fetch unit (either from the branch target buffer or the PC incrementer) is used to

retrieve an 8-byte block from the cache. Up to two instructions are decoded and placed into

the decode queue; if neither instruction is a branch, the PC is incremented for the next fetch.

Once in the decode queue, the scoreboard logic decides when the instructions can issue. In

the issue, the register operands are read; recall that in a simple scoreboard, the operands al-

ways come from the registers. The register operands and opcode are sent to the instruction

execution portion of the pipeline.

Figure 3.38 shows the execution pipeline for the A8 processor. Either instruction 1 for

instruction 2 can go to the load/store pipeline. Fully bypassing is supported among the

pipelines. The ARM Cortex-A8 pipeline uses a simple two-issue statically scheduled super-

scalar to allow reasonably high clock rate with lower power. In contrast, the i7 uses a reason-

ably aggressive, four-issue dynamically scheduled speculative pipeline structure.

Search WWH ::

Custom Search

Home