Instruction-Level Parallelism and Its Exploitation - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

handle x86 instructions that translate directly into one micro-op. For x86 instructions that

have more complex semantics, there is a microcode engine that is used to produce the

micro-op sequence; it can produce up to four micro-ops every cycle and continues until the

necessary micro-op sequence has been generated. The micro-ops are placed according to

the order of the x86 instructions in the 28-entry micro-op buffer.

4. The micro-op buffer preforms loop stream detection and microfusion —If there is a small se-

quence of instructions (less than 28 instructions or 256 bytes in length) that comprises a

loop, the loop stream detector will find the loop and directly issue the micro-ops from the

buffer, eliminating the need for the instruction fetch and instruction decode stages to be ac-

tivated. Microfusion combines instruction pairs such as load/ALU operation and ALU op-

eration/store and issues them to a single reservation station (where they can still issue inde-

pendently), thus increasing the usage of the buffer. In a study of the Intel Core architecture,

which also incorporated microfusion and macrofusion, Bird et al. [2007] discovered that

microfusion had litle impact on performance, while macrofusion appears to have a mod-

est positive impact on integer performance and litle impact on loating-point performance.

5. Perform the basic instruction issue—Looking up the register location in the register tables,

renaming the registers, allocating a reorder buffer entry, and fetching any results from the

registers or reorder buffer before sending the micro-ops to the reservation stations.

6. The i7 uses a 36-entry centralized reservation station shared by six functional units. Up to

six micro-ops may be dispatched to the functional units every clock cycle.

7. Micro-ops are executed by the individual function units and then results are sent back to

any waiting reservation station as well as to the register retirement unit, where they will

update the register state, once it is known that the instruction is no longer speculative. The

entry corresponding to the instruction in the reorder buffer is marked as complete.

8. When one or more instructions at the head of the reorder buffer have been marked as com-

plete, the pending writes in the register retirement unit are executed, and the instructions

are removed from the reorder buffer.

Performance of the i7

In earlier sections, we examined the performance of the i7's branch predictor and also the per-

formance of SMT. In this section, we look at single-thread pipeline performance. Because of

the presence of aggressive speculation as well as nonblocking caches, it is diicult to atribute

the gap between idealized performance and actual performance accurately. As we will see,

relatively few stalls occur because instructions cannot issue. For example, only about 3% of

the loads are delayed because no reservation station is available. Most losses come either from

branch mispredicts or cache misses. The cost of a branch mispredict is 15 cycles, while the cost

of an L1 miss is about 10 cycles; L2 misses are slightly more than three times as costly as an

L1 miss, and L3 misses cost about 13 times what an L1 miss costs (130-135 cycles)! Although

the processor will atempt to ind alternative instructions to execute for L3 misses and some

L2 misses, it is likely that some of the buffers will fill before the miss completes, causing the

processor to stop issuing instructions.

To examine the cost of mispredicts and incorrect speculation, Figure 3.42 shows the fraction

of the work (measured by the numbers of micro-ops dispatched into the pipeline) that do not

retire (i.e., their results are annulled), relative to all micro-op dispatches. For sjeng, for ex-

ample, 25% of the work is wasted, since 25% of the dispatched micro-ops are never retired.

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home