Pipelining: Basic and Intermediate Concepts - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

ate between 32-bit and 64-bit versions. Immediate versions of these instructions use the

same mnemonics with a suffix of I . In MIPS, there are both signed and unsigned forms

of the arithmetic instructions; the unsigned forms, which do not generate overflow excep-

tions—and thus are the same in 32-bit and 64-bit mode—have a U at the end (e.g., DADDU ,

DSUBU , DADDIU ).

2. Load and store instructions —These instructions take a register source, called the base register ,

and an immediate field (16-bit in MIPS), called the offset , as operands. The sum—called the

efective address —of the contents of the base register and the sign-extended offset is used

as a memory address. In the case of a load instruction, a second register operand acts as

the destination for the data loaded from memory. In the case of a store, the second register

operand is the source of the data that is stored into memory. The instructions load word

( LD ) and store word ( SD ) load or store the entire 64-bit register contents.

3. Branches and jumps —Branches are conditional transfers of control. There are usually two

ways of specifying the branch condition in RISC architectures: with a set of condition bits

(sometimes called a condition code ) or by a limited set of comparisons between a pair of re-

gisters or between a register and zero. MIPS uses the later. For this appendix, we consider

only comparisons for equality between two registers. In all RISC architectures, the branch

destination is obtained by adding a sign-extended offset (16 bits in MIPS) to the current

PC. Unconditional jumps are provided in many RISC architectures, but we will not cover

jumps in this appendix.

A Simple Implementation Of A RISC Instruction Set

To understand how a RISC instruction set can be implemented in a pipelined fashion, we

need to understand how it is implemented without pipelining. This section shows a simple im-

plementation where every instruction takes at most 5 clock cycles. We will extend this basic

implementation to a pipelined version, resulting in a much lower CPI. Our unpipelined im-

plementation is not the most economical or the highest-performance implementation without

pipelining. Instead, it is designed to lead naturally to a pipelined implementation. Implement-

ing the instruction set requires the introduction of several temporary registers that are not part

of the architecture; these are introduced in this section to simplify pipelining. Our implement-

ation will focus only on a pipeline for an integer subset of a RISC architecture that consists of

load-store word, branch, and integer ALU operations.

Every instruction in this RISC subset can be implemented in at most 5 clock cycles. The 5

clock cycles are as follows.

1. Instruction fetch cycle (IF):

Send the program counter (PC) to memory and fetch the current instruction from memory.

Update the PC to the next sequential PC by adding 4 (since each instruction is 4 bytes) to

the PC.

2. Instruction decode/register fetch cycle (ID):

Decode the instruction and read the registers corresponding to register source speciiers

from the register file. Do the equality test on the registers as they are read, for a possible

branch. Sign-extend the offset field of the instruction in case it is needed. Compute the pos-

sible branch target address by adding the sign-extended offset to the incremented PC. In

an aggressive implementation, which we explore later, the branch can be completed at the

end of this stage by storing the branch-target address into the PC, if the condition test yiel-

ded true.

Decoding is done in parallel with reading registers, which is possible because the register

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home