Hardware Reference
In-Depth Information
limiting the number of instructions of a given class (say, one FP, one integer, one load, one
store), the necessary reservation stations can be preallocated. Should sufficient reservation
stations not be available (such as when the next few instructions in the program are all of
one instruction type), the bundle is broken, and only a subset of the instructions, in the
original program order, is issued. The remainder of the instructions in the bundle can be
placed in the next bundle for potential issue.
2. Analyze all the dependences among the instructions in the issue bundle.
3. If an instruction in the bundle depends on an earlier instruction in the bundle, use the as-
signed reorder buffer number to update the reservation table for the dependent instruc-
tion. Otherwise, use the existing reservation table and reorder buffer information to update
the reservation table entries for the issuing instruction.
Of course, what makes the above very complicated is that it is all done in parallel in a single
clock cycle!
At the back-end of the pipeline, we must be able to complete and commit multiple instruc-
tions per clock. These steps are somewhat easier than the issue problems since multiple in-
structions that can actually commit in the same clock cycle must have already dealt with and
resolved any dependences. As we will see, designers have figured out how to handle this com-
plexity: The Intel i7, which we examine in Section 3.13 , uses essentially the scheme we have
described for speculative multiple issue, including a large number of reservation stations, a re-
order buffer, and a load and store buffer that is also used to handle nonblocking cache misses.
From a performance viewpoint, we can show how the concepts it together with an example.
Example
Consider the execution of the following loop, which increments each element of
an integer array, on a two-issue processor, once without speculation and once
with speculation:
Loop: LD R2,0(R1) ;R2=array element
DADDIU R2,R2,#1 ;increment R2
SD R2,0(R1) ;store result
DADDIU R1,R1,#8 ;increment pointer
BNE R2,R3,LOOP ;branch if not last element
Assume that there are separate integer functional units for effective address
calculation, for ALU operations, and for branch condition evaluation. Create a
table for the first three iterations of this loop for both processors. Assume that
up to two instructions of any type can commit per clock.
Answer
Figures 3.19 and 3.20 show the performance for a two-issue dynamically sched-
uled processor, without and with speculation. In this case, where a branch
can be a critical performance limiter, speculation helps significantly. The third
branch in the speculative processor executes in clock cycle 13, while it executes
in clock cycle 19 on the nonspeculative pipeline. Because the completion rate on
the nonspeculative pipeline is falling behind the issue rate rapidly, the nonspec-
ulative pipeline will stall when a few more iterations are issued. The perform-
ance of the nonspeculative processor could be improved by allowing load in-
structions to complete effective address calculation before a branch is decided,
Search WWH ::




Custom Search