Hardware Reference
In-Depth Information
Although many figures will omit such registers for simplicity, they are required to make the
pipeline operate properly and must be present. Of course, similar registers would be needed
even in a multicycle data path that had no pipelining (since only values in registers are pre-
served across clock boundaries). In the case of a pipelined processor, the pipeline registers also
play the key role of carrying intermediate results from one stage to another where the source
and destination may not be directly adjacent. For example, the register value to be stored dur-
ing a store instruction is read during ID, but not actually used until MEM; it is passed through
two pipeline registers to reach the data memory during the MEM stage. Likewise, the res-
ult of an ALU instruction is computed during EX, but not actually stored until WB; it arrives
there by passing through two pipeline registers. It is sometimes useful to name the pipeline
registers, and we follow the convention of naming them by the pipeline stages they connect,
so that the registers are called IF/ID, ID/EX, EX/MEM, and MEM/WB.
Basic Performance Issues In Pipelining
Pipelining increases the CPU instruction throughput—the number of instructions completed
per unit of time—but it does not reduce the execution time of an individual instruction. In fact,
it usually slightly increases the execution time of each instruction due to overhead in the con-
trol of the pipeline. The increase in instruction throughput means that a program runs faster
and has lower total execution time, even though no single instruction runs faster!
The fact that the execution time of each instruction does not decrease puts limits on the prac-
tical depth of a pipeline, as we will see in the next section. In addition to limitations arising
from pipeline latency, limits arise from imbalance among the pipe stages and from pipelining
overhead. Imbalance among the pipe stages reduces performance since the clock can run no
faster than the time needed for the slowest pipeline stage. Pipeline overhead arises from the
combination of pipeline register delay and clock skew. The pipeline registers add setup time,
which is the time that a register input must be stable before the clock signal that triggers a
write occurs, plus propagation delay to the clock cycle. Clock skew, which is maximum delay
between when the clock arrives at any two registers, also contributes to the lower limit on the
clock cycle. Once the clock cycle is as small as the sum of the clock skew and latch overhead,
no further pipelining is useful, since there is no time left in the cycle for useful work. The in-
terested reader should see Kunkel and Smith [1986]. As we saw in Chapter 3 , this overhead
afected the performance gains achieved by the Pentium 4 versus the Pentium III.
Example
Consider the unpipelined processor in the previous section. Assume that it has
a 1 ns clock cycle and that it uses 4 cycles for ALU operations and branches and
5 cycles for memory operations. Assume that the relative frequencies of these
operations are 40%, 20%, and 40%, respectively. Suppose that due to clock skew
and setup, pipelining the processor adds 0.2 ns of overhead to the clock. Ignor-
ing any latency impact, how much speedup in the instruction execution rate
will we gain from a pipeline?
Answer
The average instruction execution time on the unpipelined processor is
Search WWH ::




Custom Search