Hardware Reference
In-Depth Information
BEQAL R9,R3,match2
BEQAL R9,R4,match3
BEQAL R9,R5,match4
BEQAL R9,R6,match5
BEQAL R9,R7,match6
BEQAL R9,R8,match7
DADDIU R16,R16,#64
BLT R16,R17,loop
Assume the following:
■ A barrier is used to ensure that all threads begin simultaneously.
■ The first L1 cache miss occurs after two iterations of the loop.
■ None of the BEQAL branches is taken.
■ The BLT is always taken.
■ All three processors schedule threads in a round-robin fashion.
Determine how many cycles are required for each processor to complete the first two itera-
tions of the loop.
3.14 [25/25/25] <3.2, 3.7> In this exercise, we look at how software techniques can extract
instruction-level parallelism (ILP) in a common vector loop. The following loop is the so-
called DAXPY loop (double-precision aX plus Y ) and is the central operation in Gaussian
elimination. The following code implements the DAXPY operation, Y = aX + Y , for a vector
length 100. Initially, R1 is set to the base address of array X and R2 is set to the base address
of Y :
DADDIU R4,R1,#800 ; R1 = upper bound for X
foo: L.D F2,0(R1) ; (F2) = X(i)
MUL.D F4,F2,F0 ; (F4) = a*X(i)
L.D F6,0(R2) ; (F6) = Y(i)
ADD.D F6,F4,F6 ; (F6) = a*X(i) + Y(i)
S.D F6,0(R2) ; Y(i) = a*X(i) + Y(i)
DADDIU R1,R1,#8 ; increment X index
DADDIU R2,R2,#8 ; increment Y index
DSLTU R3,R1,R4 ; test: continue loop?
BNEZ R3,foo ; loop if needed
Assume the functional unit latencies as shown in the table below. Assume a one-cycle
delayed branch that resolves in the ID stage. Assume that results are fully bypassed.
Instruction producing result
Instruction using result Latency in clock cycles
FP multiply
FP add
FP multiply
FP store
FP add
FP store
Integer operations and all loads
Search WWH ::

Custom Search