Instruction-Level Parallelism and Its Exploitation - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

Answer

Here is the result after merging the DADDUI instructions and dropping the unne-

cessary BNE operations that are duplicated during unrolling. Note that R2 must

now be set so that 32(R2) is the starting address of the last four elements.

Loop: L.D

F0,0(R1)

ADD.D

F4,F0,F2

S.D

F4,0(R1)

;drop DADDUI & BNE

L.D

F6,−8(R1)

ADD.D

F8,F6,F2

S.D

F8,−8(R1)

;drop DADDUI & BNE

L.D

F10,−16(R1)

ADD.D

F12,F10,F2

S.D

F12,−16(R1)

;drop DADDUI & BNE

L.D F14,−24(R1)

ADD.D F16,F14,F2

S.D F16,−24(R1)

DADDUI R1,R1,#−32

BNE R1,R2,Loop

We have eliminated three branches and three decrements of R1 . The addresses

on the loads and stores have been compensated to allow the DADDUI instructions

on R1 to be merged. This optimization may seem trivial, but it is not; it requires

symbolic substitution and simplification. Symbolic substitution and simpliica-

tion will rearrange expressions so as to allow constants to be collapsed, allow-

ing an expression such as (( i + 1) + 1) to be rewriten as ( i + (1 + 1)) and then

simpliied to ( i + 2). We will see more general forms of these optimizations that

eliminate dependent computations in Appendix H.

Without scheduling, every operation in the unrolled loop is followed by a

dependent operation and thus will cause a stall. This loop will run in 27 clock

cycles—each LD has 1 stall, each ADDD 2, the DADDUI 1, plus 14 instruction issue

cycles—or 6.75 clock cycles for each of the four elements, but it can be scheduled

to improve performance significantly. Loop unrolling is normally done early in

the compilation process, so that redundant computations can be exposed and

eliminated by the optimizer.

In real programs we do not usually know the upper bound on the loop. Suppose it is n , and

we would like to unroll the loop to make k copies of the body. Instead of a single unrolled

loop, we generate a pair of consecutive loops. The first executes ( n mod k ) times and has a

body that is the original loop. The second is the unrolled body surrounded by an outer loop

that iterates ( n/k ) times. (As we shall see in Chapter 4 , this technique is similar to a technique

called strip mining , used in compilers for vector processors.) For large values of n , most of the

execution time will be spent in the unrolled loop body.

In the previous example, unrolling improves the performance of this loop by eliminating

overhead instructions, although it increases code size substantially. How will the unrolled

loop perform when it is scheduled for the pipeline described earlier?

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home