Instruction-Level Parallelism and Its Exploitation - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

quires analyzing the memory addresses and finding that they do not refer to the same ad-

dress.

■ Schedule the code, preserving any dependences needed to yield the same result as the ori-

ginal code.

The key requirement underlying all of these transformations is an understanding of how one

instruction depends on another and how the instructions can be changed or reordered given

the dependences.

Three different effects limit the gains from loop unrolling: (1) a decrease in the amount of

overhead amortized with each unroll, (2) code size limitations, and (3) compiler limitations.

Let's consider the question of loop overhead first. When we unrolled the loop four times, it

generated sufficient parallelism among the instructions that the loop could be scheduled with

no stall cycles. In fact, in 14 clock cycles, only 2 cycles were loop overhead: the DADDUI , which

maintains the index value, and the BNE , which terminates the loop. If the loop is unrolled eight

times, the overhead is reduced from 1/2 cycle per original iteration to 1/4.

A second limit to unrolling is the growth in code size that results. For larger loops, the code

size growth may be a concern particularly if it causes an increase in the instruction cache miss

rate.

Another factor often more important than code size is the potential shortfall in registers that

is created by aggressive unrolling and scheduling. This secondary effect that results from in-

struction scheduling in large code segments is called register pressure . It arises because schedul-

ing code to increase ILP causes the number of live values to increase. After aggressive instruc-

tion scheduling, it may not be possible to allocate all the live values to registers. The trans-

formed code, while theoretically faster, may lose some or all of its advantage because it gener-

ates a shortage of registers. Without unrolling, aggressive scheduling is sufficiently limited by

branches so that register pressure is rarely a problem. The combination of unrolling and ag-

gressive scheduling can, however, cause this problem. The problem becomes especially chal-

lenging in multiple-issue processors that require the exposure of more independent instruc-

tion sequences whose execution can be overlapped. In general, the use of sophisticated high-

level transformations, whose potential improvements are difficult to measure before detailed

code generation, has led to significant increases in the complexity of modern compilers.

Loop unrolling is a simple but useful method for increasing the size of straight-line code

fragments that can be scheduled effectively. This transformation is useful in a variety of pro-

cessors, from simple pipelines like those we have examined so far to the multiple-issue super-

scalars and VLIWs explored later in this chapter.

3.3 Reducing Branch Costs with Advanced Branch

Prediction

Because of the need to enforce control dependences through branch hazards and stalls,

branches will hurt pipeline performance. Loop unrolling is one way to reduce the number of

branch hazards; we can also reduce the performance losses of branches by predicting how

they will behave. In Appendix C , we examine simple branch predictors that rely either on

compile-time information or on the observed dynamic behavior of a branch in isolation. As the

number of instructions in flight has increased, the importance of more accurate branch pre-

diction has grown. In this section, we examine techniques for improving dynamic prediction

accuracy.

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home