Hardware Reference
In-Depth Information
handle x86 instructions that translate directly into one micro-op. For x86 instructions that
have more complex semantics, there is a microcode engine that is used to produce the
micro-op sequence; it can produce up to four micro-ops every cycle and continues until the
necessary micro-op sequence has been generated. The micro-ops are placed according to
the order of the x86 instructions in the 28-entry micro-op buffer.
4. The micro-op buffer preforms loop stream detection and microfusion —If there is a small se-
quence of instructions (less than 28 instructions or 256 bytes in length) that comprises a
loop, the loop stream detector will find the loop and directly issue the micro-ops from the
buffer, eliminating the need for the instruction fetch and instruction decode stages to be ac-
tivated. Microfusion combines instruction pairs such as load/ALU operation and ALU op-
eration/store and issues them to a single reservation station (where they can still issue inde-
pendently), thus increasing the usage of the buffer. In a study of the Intel Core architecture,
which also incorporated microfusion and macrofusion, Bird et al. [2007] discovered that
microfusion had litle impact on performance, while macrofusion appears to have a mod-
est positive impact on integer performance and litle impact on loating-point performance.
5. Perform the basic instruction issue—Looking up the register location in the register tables,
renaming the registers, allocating a reorder buffer entry, and fetching any results from the
registers or reorder buffer before sending the micro-ops to the reservation stations.
6. The i7 uses a 36-entry centralized reservation station shared by six functional units. Up to
six micro-ops may be dispatched to the functional units every clock cycle.
7. Micro-ops are executed by the individual function units and then results are sent back to
any waiting reservation station as well as to the register retirement unit, where they will
update the register state, once it is known that the instruction is no longer speculative. The
entry corresponding to the instruction in the reorder buffer is marked as complete.
8. When one or more instructions at the head of the reorder buffer have been marked as com-
plete, the pending writes in the register retirement unit are executed, and the instructions
are removed from the reorder buffer.
Performance of the i7
In earlier sections, we examined the performance of the i7's branch predictor and also the per-
formance of SMT. In this section, we look at single-thread pipeline performance. Because of
the presence of aggressive speculation as well as nonblocking caches, it is diicult to atribute
the gap between idealized performance and actual performance accurately. As we will see,
relatively few stalls occur because instructions cannot issue. For example, only about 3% of
the loads are delayed because no reservation station is available. Most losses come either from
branch mispredicts or cache misses. The cost of a branch mispredict is 15 cycles, while the cost
of an L1 miss is about 10 cycles; L2 misses are slightly more than three times as costly as an
L1 miss, and L3 misses cost about 13 times what an L1 miss costs (130-135 cycles)! Although
the processor will atempt to ind alternative instructions to execute for L3 misses and some
L2 misses, it is likely that some of the buffers will fill before the miss completes, causing the
processor to stop issuing instructions.
To examine the cost of mispredicts and incorrect speculation, Figure 3.42 shows the fraction
of the work (measured by the numbers of micro-ops dispatched into the pipeline) that do not
retire (i.e., their results are annulled), relative to all micro-op dispatches. For sjeng, for ex-
ample, 25% of the work is wasted, since 25% of the dispatched micro-ops are never retired.
Search WWH ::




Custom Search