Hardware Reference
In-Depth Information
per cycle (one op to each ALU plus one memory op to the LD / ST ).
a. [15] <3.4> Suppose all of the instructions from the sequence in Figure 3.48 are present
in the RS, with no renaming having been done. Highlight any instructions in the code
where register renaming would improve performance. ( Hint : Look for read-after-write
and write-after-write hazards. Assume the same functional unit latencies as in Figure
3.48 .)
b. [20] <3.4> Suppose the register-renamed version of the code from part (a) is resident in
the RS in clock cycle N , with latencies as given in Figure 3.48 . Show how the RS should
dispatch these instructions out of order, clock by clock, to obtain optimal performance
on this code. (Assume the same RS restrictions as in part (a). Also assume that res-
ults must be writen into the RS before they're available for use—no bypassing.) How
many clock cycles does the code sequence take?
c. [20] <3.4> Part (b) lets the RS try to optimally schedule these instructions. But in reality,
the whole instruction sequence of interest is not usually present in the RS. Instead,
various events clear the RS, and as a new code sequence streams in from the decoder,
the RS must choose to dispatch what it has. Suppose that the RS is empty. In cycle 0,
the first two register-renamed instructions of this sequence appear in the RS. Assume
it takes one clock cycle to dispatch any op, and assume functional unit latencies are
as they were for Exercise 3.2 . Further assume that the front end (decoder/register-re-
namer) will continue to supply two new instructions per clock cycle. Show the cycle-
by-cycle order of dispatch of the RS. How many clock cycles does this code sequence
require now?
d. [10] <3.14> If you wanted to improve the results of part (c), which would have helped
most: (1) Another ALU? (2) Another LD/ST unit? (3) Full bypassing of ALU results to
subsequent operations? or (4) Cuting the longest latency in half? What's the speedup?
e. [20] <3.7> Now let's consider speculation, the act of fetching, decoding, and executing
beyond one or more conditional branches. Our motivation to do this is twofold: The
dispatch schedule we came up with in part (c) had lots of nops, and we know com-
puters spend most of their time executing loops (which implies the branch back to the
top of the loop is prety predictable). Loops tell us where to ind more work to do;
our sparse dispatch schedule suggests we have opportunities to do some of that work
earlier than before. In part (d) you found the critical path through the loop. Imagine
folding a second copy of that path onto the schedule you got in part (b). How many
more clock cycles would be required to do two loops' worth of work (assuming all in-
structions are resident in the RS)? (Assume all functional units are fully pipelined.)
Search WWH ::




Custom Search