Instruction-Level Parallelism and Its Exploitation - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

begins in Pipe 0, and N + 1 happens to require a shorter execution latency than N , then N +

1 will complete before N (even though program ordering would have implied otherwise).

Recite at least two reasons why that could be hazardous and will require special considera-

tions in the microarchitecture. Give an example of two instructions from the code in Figure

3.48 that demonstrate this hazard.

3.5 [20] <3.7> Reorder the instructions to improve performance of the code in Figure 3.48 . As-

sume the two-pipe machine in Exercise 3.3 and that the out-of-order completion issues of

Exercise 3.4 have been dealt with successfully. Just worry about observing true data de-

pendences and functional unit latencies for now. How many cycles does your reordered

code take?

3.6 [10/10/10] <3.1, 3.2> Every cycle that does not initiate a new operation in a pipe is a lost

opportunity, in the sense that your hardware is not living up to its potential.

a. [10] <3.1, 3.2> In your reordered code from Exercise 3.5 , what fraction of all cycles,

counting both pipes, were wasted (did not initiate a new op)?

b. [10] <3.1, 3.2> Loop unrolling is one standard compiler technique for finding more par-

allelism in code, in order to minimize the lost opportunities for performance. Hand-

unroll two iterations of the loop in your reordered code from Exercise 3.5 .

c. [10] <3.1, 3.2> What speedup did you obtain? (For this exercise, just color the N + 1 iter-

ation's instructions green to distinguish them from the N th iteration's instructions; if

you were actually unrolling the loop, you would have to reassign registers to prevent

collisions between the iterations.)

3.7 [15] <2.1> Computers spend most of their time in loops, so multiple loop iterations are

great places to speculatively find more work to keep CPU resources busy. Nothing is ever

easy, though; the compiler emited only one copy of that loop's code, so even though mul-

tiple iterations are handling distinct data, they will appear to use the same registers. To

keep multiple iterations' register usages from colliding, we rename their registers. Figure

3.49 shows example code that we would like our hardware to rename. A compiler could

have simply unrolled the loop and used different registers to avoid conflicts, but if we ex-

pect our hardware to unroll the loop, it must also do the register renaming. How? Assume

your hardware has a pool of temporary registers (call them T registers, and assume that

there are 64 of them, T0 through T63 ) that it can substitute for those registers designated

by the compiler. This rename hardware is indexed by the src (source) register designation,

and the value in the table is the T register of the last destination that targeted that register.

(Think of these table values as producers, and the src registers are the consumers; it doesn't

much mater where the producer puts its result as long as its consumers can ind it.) Con-

sider the code sequence in Figure 3.49 . Every time you see a destination register in the code,

substitute the next available T , beginning with T9 . Then update all the src registers accord-

ingly, so that true data dependences are maintained. Show the resulting code. ( Hint : See

Figure 3.50 . )

Search WWH ::

Custom Search

Home