Instruction-Level Parallelism and Its Exploitation - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

limiting the number of instructions of a given class (say, one FP, one integer, one load, one

store), the necessary reservation stations can be preallocated. Should sufficient reservation

stations not be available (such as when the next few instructions in the program are all of

one instruction type), the bundle is broken, and only a subset of the instructions, in the

original program order, is issued. The remainder of the instructions in the bundle can be

placed in the next bundle for potential issue.

2. Analyze all the dependences among the instructions in the issue bundle.

3. If an instruction in the bundle depends on an earlier instruction in the bundle, use the as-

signed reorder buffer number to update the reservation table for the dependent instruc-

tion. Otherwise, use the existing reservation table and reorder buffer information to update

the reservation table entries for the issuing instruction.

Of course, what makes the above very complicated is that it is all done in parallel in a single

clock cycle!

At the back-end of the pipeline, we must be able to complete and commit multiple instruc-

tions per clock. These steps are somewhat easier than the issue problems since multiple in-

structions that can actually commit in the same clock cycle must have already dealt with and

resolved any dependences. As we will see, designers have figured out how to handle this com-

plexity: The Intel i7, which we examine in Section 3.13 , uses essentially the scheme we have

described for speculative multiple issue, including a large number of reservation stations, a re-

order buffer, and a load and store buffer that is also used to handle nonblocking cache misses.

From a performance viewpoint, we can show how the concepts it together with an example.

Example

Consider the execution of the following loop, which increments each element of

an integer array, on a two-issue processor, once without speculation and once

with speculation:

Loop: LD R2,0(R1) ;R2=array element

DADDIU R2,R2,#1 ;increment R2

SD R2,0(R1) ;store result

DADDIU R1,R1,#8 ;increment pointer

BNE R2,R3,LOOP ;branch if not last element

Assume that there are separate integer functional units for effective address

calculation, for ALU operations, and for branch condition evaluation. Create a

table for the first three iterations of this loop for both processors. Assume that

up to two instructions of any type can commit per clock.

Answer

Figures 3.19 and 3.20 show the performance for a two-issue dynamically sched-

uled processor, without and with speculation. In this case, where a branch

can be a critical performance limiter, speculation helps significantly. The third

branch in the speculative processor executes in clock cycle 13, while it executes

in clock cycle 19 on the nonspeculative pipeline. Because the completion rate on

the nonspeculative pipeline is falling behind the issue rate rapidly, the nonspec-

ulative pipeline will stall when a few more iterations are issued. The perform-

ance of the nonspeculative processor could be improved by allowing load in-

structions to complete effective address calculation before a branch is decided,

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home