Hardware Reference
In-Depth Information
quires analyzing the memory addresses and finding that they do not refer to the same ad-
dress.
■ Schedule the code, preserving any dependences needed to yield the same result as the ori-
ginal code.
The key requirement underlying all of these transformations is an understanding of how one
instruction depends on another and how the instructions can be changed or reordered given
the dependences.
Three different effects limit the gains from loop unrolling: (1) a decrease in the amount of
overhead amortized with each unroll, (2) code size limitations, and (3) compiler limitations.
Let's consider the question of loop overhead first. When we unrolled the loop four times, it
generated sufficient parallelism among the instructions that the loop could be scheduled with
no stall cycles. In fact, in 14 clock cycles, only 2 cycles were loop overhead: the DADDUI , which
maintains the index value, and the BNE , which terminates the loop. If the loop is unrolled eight
times, the overhead is reduced from 1/2 cycle per original iteration to 1/4.
A second limit to unrolling is the growth in code size that results. For larger loops, the code
size growth may be a concern particularly if it causes an increase in the instruction cache miss
rate.
Another factor often more important than code size is the potential shortfall in registers that
is created by aggressive unrolling and scheduling. This secondary effect that results from in-
struction scheduling in large code segments is called register pressure . It arises because schedul-
ing code to increase ILP causes the number of live values to increase. After aggressive instruc-
tion scheduling, it may not be possible to allocate all the live values to registers. The trans-
formed code, while theoretically faster, may lose some or all of its advantage because it gener-
ates a shortage of registers. Without unrolling, aggressive scheduling is sufficiently limited by
branches so that register pressure is rarely a problem. The combination of unrolling and ag-
gressive scheduling can, however, cause this problem. The problem becomes especially chal-
lenging in multiple-issue processors that require the exposure of more independent instruc-
tion sequences whose execution can be overlapped. In general, the use of sophisticated high-
level transformations, whose potential improvements are difficult to measure before detailed
code generation, has led to significant increases in the complexity of modern compilers.
Loop unrolling is a simple but useful method for increasing the size of straight-line code
fragments that can be scheduled effectively. This transformation is useful in a variety of pro-
cessors, from simple pipelines like those we have examined so far to the multiple-issue super-
scalars and VLIWs explored later in this chapter.
3.3 Reducing Branch Costs with Advanced Branch
Prediction
Because of the need to enforce control dependences through branch hazards and stalls,
branches will hurt pipeline performance. Loop unrolling is one way to reduce the number of
branch hazards; we can also reduce the performance losses of branches by predicting how
they will behave. In Appendix C , we examine simple branch predictors that rely either on
compile-time information or on the observed dynamic behavior of a branch in isolation. As the
number of instructions in flight has increased, the importance of more accurate branch pre-
diction has grown. In this section, we examine techniques for improving dynamic prediction
accuracy.
 
Search WWH ::




Custom Search