Instruction-Level Parallelism and Its Exploitation - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

FIGURE 3.41 The Intel Core i7 pipeline structure shown with the memory system com-

ponents . The total pipeline depth is 14 stages, with branch mispredictions costing 17 cycles.

There are 48 load and 32 store buffers. The six independent functional units can each begin

execution of a ready micro-op in the same cycle.

1. Instruction fetch—The processor uses a multilevel branch target buffer to achieve a balance

between speed and prediction accuracy. There is also a return address stack to speed up

function return. Mispredictions cause a penalty of about 15 cycles. Using the predicted ad-

dress, the instruction fetch unit fetches 16 bytes from the instruction cache.

2. The 16 bytes are placed in the predecode instruction buffer—In this step, a process called

macro-op fusion is executed. Macro-op fusion takes instruction combinations such as com-

pare followed by a branch and fuses them into a single operation. The predecode stage

also breaks the 16 bytes into individual x86 instructions. This predecode is nontrivial since

the length of an x86 instruction can be from 1 to 17 bytes and the predecoder must look

through a number of bytes before it knows the instruction length. Individual x86 instruc-

tions (including some fused instructions) are placed into the 18-entry instruction queue.

3. Micro-op decode—Individual x86 instructions are translated into micro-ops. Micro-ops are

simple MIPS-like instructions that can be executed directly by the pipeline; this approach

of translating the x86 instruction set into simple operations that are more easily pipelined

was introduced in the Pentium Pro in 1997 and has been used since. Three of the decoders

Search WWH ::

Custom Search

Home