Instruction-Level Parallelism and Its Exploitation - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

per cycle (one op to each ALU plus one memory op to the LD / ST ).

a. [15] <3.4> Suppose all of the instructions from the sequence in Figure 3.48 are present

in the RS, with no renaming having been done. Highlight any instructions in the code

where register renaming would improve performance. ( Hint : Look for read-after-write

and write-after-write hazards. Assume the same functional unit latencies as in Figure

3.48 .)

b. [20] <3.4> Suppose the register-renamed version of the code from part (a) is resident in

the RS in clock cycle N , with latencies as given in Figure 3.48 . Show how the RS should

dispatch these instructions out of order, clock by clock, to obtain optimal performance

on this code. (Assume the same RS restrictions as in part (a). Also assume that res-

ults must be writen into the RS before they're available for use—no bypassing.) How

many clock cycles does the code sequence take?

c. [20] <3.4> Part (b) lets the RS try to optimally schedule these instructions. But in reality,

the whole instruction sequence of interest is not usually present in the RS. Instead,

various events clear the RS, and as a new code sequence streams in from the decoder,

the RS must choose to dispatch what it has. Suppose that the RS is empty. In cycle 0,

the first two register-renamed instructions of this sequence appear in the RS. Assume

it takes one clock cycle to dispatch any op, and assume functional unit latencies are

as they were for Exercise 3.2 . Further assume that the front end (decoder/register-re-

namer) will continue to supply two new instructions per clock cycle. Show the cycle-

by-cycle order of dispatch of the RS. How many clock cycles does this code sequence

require now?

d. [10] <3.14> If you wanted to improve the results of part (c), which would have helped

most: (1) Another ALU? (2) Another LD/ST unit? (3) Full bypassing of ALU results to

subsequent operations? or (4) Cuting the longest latency in half? What's the speedup?

e. [20] <3.7> Now let's consider speculation, the act of fetching, decoding, and executing

beyond one or more conditional branches. Our motivation to do this is twofold: The

dispatch schedule we came up with in part (c) had lots of nops, and we know com-

puters spend most of their time executing loops (which implies the branch back to the

top of the loop is prety predictable). Loops tell us where to ind more work to do;

our sparse dispatch schedule suggests we have opportunities to do some of that work

earlier than before. In part (d) you found the critical path through the loop. Imagine

folding a second copy of that path onto the schedule you got in part (b). How many

more clock cycles would be required to do two loops' worth of work (assuming all in-

structions are resident in the RS)? (Assume all functional units are fully pipelined.)

Search WWH ::

Custom Search

Home