Instruction-Level Parallelism and Its Exploitation - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

Clock cycle issued

1

Loop:

L.D

F0,0(R1)

stall

2

3

ADD.D

F4,F0,F2

stall

4

stall

5

6

S.D

F4,0(R1)

7

DADDUI R1,R1,#−8

stall 8

BNE R1,R2,Loop 9

We can schedule the loop to obtain only two stalls and reduce the time to sev-

en cycles:

Loop: L.D F0,0(R1)

DADDUI R1,R1,#−8

ADD.D

F4,F0,F2

stall

S.D F4,8(R1)

BNE R1,R2,Loop

The stalls after ADD.D are for use by the S.D .

In the previous example, we complete one loop iteration and store back one array element

every seven clock cycles, but the actual work of operating on the array element takes just three

(the load, add, and store) of those seven clock cycles. The remaining four clock cycles consist

of loop overhead—the DADDUI and BNE —and two stalls. To eliminate these four clock cycles we

need to get more operations relative to the number of overhead instructions.

A simple scheme for increasing the number of instructions relative to the branch and over-

head instructions is loop unrolling . Unrolling simply replicates the loop body multiple times,

adjusting the loop termination code.

Loop unrolling can also be used to improve scheduling. Because it eliminates the branch,

it allows instructions from different iterations to be scheduled together. In this case, we can

eliminate the data use stalls by creating additional independent instructions within the loop

body. If we simply replicated the instructions when we unrolled the loop, the resulting use of

the same registers could prevent us from effectively scheduling the loop. Thus, we will want

to use different registers for each iteration, increasing the required number of registers.

Example

Show our loop unrolled so that there are four copies of the loop body, assuming

R1 − R2 (that is, the size of the array) is initially a multiple of 32, which means

that the number of loop iterations is a multiple of 4. Eliminate any obviously re-

dundant computations and do not reuse any of the registers.

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home