Hardware Reference
In-Depth Information
Clock cycle issued
1
Loop:
L.D
F0,0(R1)
stall
2
3
ADD.D
F4,F0,F2
stall
4
stall
5
6
S.D
F4,0(R1)
7
DADDUI R1,R1,#−8
stall 8
BNE R1,R2,Loop 9
We can schedule the loop to obtain only two stalls and reduce the time to sev-
en cycles:
Loop: L.D F0,0(R1)
DADDUI R1,R1,#−8
ADD.D
F4,F0,F2
stall
stall
S.D F4,8(R1)
BNE R1,R2,Loop
The stalls after ADD.D are for use by the S.D .
In the previous example, we complete one loop iteration and store back one array element
every seven clock cycles, but the actual work of operating on the array element takes just three
(the load, add, and store) of those seven clock cycles. The remaining four clock cycles consist
of loop overhead—the DADDUI and BNE —and two stalls. To eliminate these four clock cycles we
need to get more operations relative to the number of overhead instructions.
A simple scheme for increasing the number of instructions relative to the branch and over-
head instructions is loop unrolling . Unrolling simply replicates the loop body multiple times,
adjusting the loop termination code.
Loop unrolling can also be used to improve scheduling. Because it eliminates the branch,
it allows instructions from different iterations to be scheduled together. In this case, we can
eliminate the data use stalls by creating additional independent instructions within the loop
body. If we simply replicated the instructions when we unrolled the loop, the resulting use of
the same registers could prevent us from effectively scheduling the loop. Thus, we will want
to use different registers for each iteration, increasing the required number of registers.
Example
Show our loop unrolled so that there are four copies of the loop body, assuming
R1 − R2 (that is, the size of the array) is initially a multiple of 32, which means
that the number of loop iterations is a multiple of 4. Eliminate any obviously re-
dundant computations and do not reuse any of the registers.
Search WWH ::




Custom Search