Hardware Reference
In-Depth Information
BEQAL R9,R3,match2
BEQAL R9,R4,match3
BEQAL R9,R5,match4
BEQAL R9,R6,match5
BEQAL R9,R7,match6
BEQAL R9,R8,match7
DADDIU R16,R16,#64
BLT R16,R17,loop
Assume the following:
■ A barrier is used to ensure that all threads begin simultaneously.
■ The first L1 cache miss occurs after two iterations of the loop.
■ None of the BEQAL branches is taken.
■ The BLT is always taken.
■ All three processors schedule threads in a round-robin fashion.
Determine how many cycles are required for each processor to complete the first two itera-
tions of the loop.
3.14 [25/25/25] <3.2, 3.7> In this exercise, we look at how software techniques can extract
instruction-level parallelism (ILP) in a common vector loop. The following loop is the so-
called DAXPY loop (double-precision aX plus Y ) and is the central operation in Gaussian
elimination. The following code implements the DAXPY operation, Y = aX + Y , for a vector
length 100. Initially, R1 is set to the base address of array X and R2 is set to the base address
of Y :
DADDIU R4,R1,#800 ; R1 = upper bound for X
foo: L.D F2,0(R1) ; (F2) = X(i)
MUL.D F4,F2,F0 ; (F4) = a*X(i)
L.D F6,0(R2) ; (F6) = Y(i)
ADD.D F6,F4,F6 ; (F6) = a*X(i) + Y(i)
S.D F6,0(R2) ; Y(i) = a*X(i) + Y(i)
DADDIU R1,R1,#8 ; increment X index
DADDIU R2,R2,#8 ; increment Y index
DSLTU R3,R1,R4 ; test: continue loop?
BNEZ R3,foo ; loop if needed
Assume the functional unit latencies as shown in the table below. Assume a one-cycle
delayed branch that resolves in the ID stage. Assume that results are fully bypassed.
Instruction producing result
Instruction using result Latency in clock cycles
FP multiply
FP ALU op
6
FP add
FP ALU op
4
FP multiply
FP store
5
FP add
FP store
4
Integer operations and all loads
Any
2
 
Search WWH ::




Custom Search