Hardware Reference
In-Depth Information
Example
Show the unrolled loop in the previous example after it has been scheduled for
the pipeline with the latencies from Figure 3.2 .
Answer
Loop: L.D F0,0(R1)
L.D F6,−8(R1)
L.D F10,−16(R1)
L.D F14,−24(R1)
ADD.D F4,F0,F2
ADD.D F8,F6,F2
ADD.D F12,F10,F2
ADD.D F16,F14,F2
S.D F4,0(R1)
S.D F8,−8(R1)
DADDUI R1,R1,#−32
S.D F12,16(R1)
S.D F16,8(R1)
BNE R1,R2,Loop
The execution time of the unrolled loop has dropped to a total of 14 clock
cycles, or 3.5 clock cycles per element, compared with 9 cycles per element be-
fore any unrolling or scheduling and 7 cycles when scheduled but not unrolled.
The gain from scheduling on the unrolled loop is even larger than on the original loop. This
increase arises because unrolling the loop exposes more computation that can be scheduled to
minimize the stalls; the code above has no stalls. Scheduling the loop in this fashion necessit-
ates realizing that the loads and stores are independent and can be interchanged.
Summary Of The Loop Unrolling And Scheduling
Throughout this chapter and Appendix H, we will look at a variety of hardware and software
techniques that allow us to take advantage of instruction-level parallelism to fully utilize the
potential of the functional units in a processor. The key to most of these techniques is to know
when and how the ordering among instructions may be changed. In our example we made
many such changes, which to us, as human beings, were obviously allowable. In practice, this
process must be performed in a methodical fashion either by a compiler or by hardware. To
obtain the final unrolled code we had to make the following decisions and transformations:
■ Determine that unrolling the loop would be useful by finding that the loop iterations were
independent, except for the loop maintenance code.
■ Use different registers to avoid unnecessary constraints that would be forced by using the
same registers for different computations (e.g., name dependences).
■ Eliminate the extra test and branch instructions and adjust the loop termination and itera-
tion code.
■ Determine that the loads and stores in the unrolled loop can be interchanged by observing
that the loads and stores from different iterations are independent. This transformation re-
Search WWH ::




Custom Search