Instruction-Level Parallelism and Its Exploitation - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

BEQAL R9,R3,match2

BEQAL R9,R4,match3

BEQAL R9,R5,match4

BEQAL R9,R6,match5

BEQAL R9,R7,match6

BEQAL R9,R8,match7

DADDIU R16,R16,#64

BLT R16,R17,loop

Assume the following:

■ A barrier is used to ensure that all threads begin simultaneously.

■ The first L1 cache miss occurs after two iterations of the loop.

■ None of the BEQAL branches is taken.

■ The BLT is always taken.

■ All three processors schedule threads in a round-robin fashion.

Determine how many cycles are required for each processor to complete the first two itera-

tions of the loop.

3.14 [25/25/25] <3.2, 3.7> In this exercise, we look at how software techniques can extract

instruction-level parallelism (ILP) in a common vector loop. The following loop is the so-

called DAXPY loop (double-precision aX plus Y ) and is the central operation in Gaussian

elimination. The following code implements the DAXPY operation, Y = aX + Y , for a vector

length 100. Initially, R1 is set to the base address of array X and R2 is set to the base address

of Y :

DADDIU R4,R1,#800 ; R1 = upper bound for X

foo: L.D F2,0(R1) ; (F2) = X(i)

MUL.D F4,F2,F0 ; (F4) = a*X(i)

L.D F6,0(R2) ; (F6) = Y(i)

ADD.D F6,F4,F6 ; (F6) = a*X(i) + Y(i)

S.D F6,0(R2) ; Y(i) = a*X(i) + Y(i)

DADDIU R1,R1,#8 ; increment X index

DADDIU R2,R2,#8 ; increment Y index

DSLTU R3,R1,R4 ; test: continue loop?

BNEZ R3,foo ; loop if needed

Assume the functional unit latencies as shown in the table below. Assume a one-cycle

delayed branch that resolves in the ID stage. Assume that results are fully bypassed.

Instruction producing result

Instruction using result Latency in clock cycles

FP multiply

FP ALU op

6

FP add

FP ALU op

4

FP multiply

FP store

5

FP add

FP store

4

Integer operations and all loads

Any

2

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home