Hardware Reference
In-Depth Information
Answer
Here is the result after merging the
DADDUI
instructions and dropping the unne-
cessary
BNE
operations that are duplicated during unrolling. Note that
R2
must
now be set so that
32(R2)
is the starting address of the last four elements.
Loop: L.D
F0,0(R1)
ADD.D
F4,F0,F2
S.D
F4,0(R1)
;drop DADDUI & BNE
L.D
F6,−8(R1)
ADD.D
F8,F6,F2
S.D
F8,−8(R1)
;drop DADDUI & BNE
L.D
F10,−16(R1)
ADD.D
F12,F10,F2
S.D
F12,−16(R1)
;drop DADDUI & BNE
L.D F14,−24(R1)
ADD.D F16,F14,F2
S.D F16,−24(R1)
DADDUI R1,R1,#−32
BNE R1,R2,Loop
We have eliminated three branches and three decrements of
R1
. The addresses
on the loads and stores have been compensated to allow the
DADDUI
instructions
on
R1
to be merged. This optimization may seem trivial, but it is not; it requires
symbolic substitution and simplification. Symbolic substitution and simpliica-
tion will rearrange expressions so as to allow constants to be collapsed, allow-
ing an expression such as ((
i
+ 1) + 1) to be rewriten as (
i
+ (1 + 1)) and then
simpliied to (
i
+ 2). We will see more general forms of these optimizations that
eliminate dependent computations in Appendix H.
Without scheduling, every operation in the unrolled loop is followed by a
dependent operation and thus will cause a stall. This loop will run in 27 clock
cycles—each
LD
has 1 stall, each
ADDD
2, the
DADDUI
1, plus 14 instruction issue
cycles—or 6.75 clock cycles for each of the four elements, but it can be scheduled
to improve performance significantly. Loop unrolling is normally done early in
the compilation process, so that redundant computations can be exposed and
eliminated by the optimizer.
In real programs we do not usually know the upper bound on the loop. Suppose it is
n
, and
we would like to unroll the loop to make
k
copies of the body. Instead of a single unrolled
loop, we generate a pair of consecutive loops. The first executes (
n
mod
k
) times and has a
body that is the original loop. The second is the unrolled body surrounded by an outer loop
that iterates (
n/k
) times. (As we shall see in
Chapter 4
,
this technique is similar to a technique
called
strip mining
, used in compilers for vector processors.) For large values of
n
, most of the
execution time will be spent in the unrolled loop body.
In the previous example, unrolling improves the performance of this loop by eliminating
overhead instructions, although it increases code size substantially. How will the unrolled
loop perform when it is scheduled for the pipeline described earlier?
Search WWH ::
Custom Search