Hardware Reference
In-Depth Information
In an attempt to get around these problems and produce better performance,
some CPUs allow dependent instructions to be skipped over, to get to future in-
structions that are not dependent. Needless to say, the internal instruction-schedul-
ing algorithm used must deliver the same effect as if the program were executed in
the order written. We will now demonstrate how instruction reordering works
using a detailed example.
To illustrate the nature of the problem, we will start with a machine that always
issues instructions in program order and also requires them to complete execution
in program order. The significance of the latter will become clear later.
Our example machine has eight registers visible to the programmer, R0
through R7 . All arithmetic instructions use three registers: two for the operands
and one for the result, the same as the Mic-4. We will assume that if an instruction
is decoded in cycle n , execution starts in cycle n
1. For a simple instruction,
such as an addition or subtraction, the writeback to the destination register occurs
at the end of cycle n
+
2. For a more complicated instruction, such as a multiplica-
tion, the writeback occurs at the end of cycle n
+
3. To make the example realistic,
we will allow the decode unit to issue up to two instructions per clock cycle. Com-
mercial superscalar CPUs often can issue four or even six instructions per clock
cycle.
Our example execution sequence is shown in Fig. 4-43. Here the first column
gives the number of the cycle and the second one gives the instruction number.
The third column lists the instruction decoded. The fourth one tells which instruc-
tion is being issued (with a maximum of two per clock cycle). The fifth one tells
which instruction has been retired (completed). Remember that in this example we
are requiring both in-order issue and in-order completion, so instruction k
+
+
1 can-
not be issued until instruction k has been issued, and instruction k
1 cannot be
retired (meaning the writeback to the destination register is performed) until in-
struction k has been retired. The other 16 columns are discussed below.
After decoding an instruction, the decode unit has to decide whether or not it
can be issued immediately. To make this decision, the decode unit needs to know
the status of all the registers. If, for example, the current instruction needs a regis-
ter whose value has not yet been computed, the current instruction cannot be issued
and the CPU must stall.
We will keep track of register use with a device called a scoreboard , which
was first present in the CDC 6600. The scoreboard has a small counter for each
register telling how many times that register is in use as a source by currently ex-
ecuting instructions. If a maximum of, say, 15 instructions may be executing at
once, then a 4-bit counter will do. When an instruction is issued, the scoreboard
entries for its operand registers are incremented. When an instruction is retired,
the entries are decremented.
The scoreboard also has counters to keep track of registers being used as desti-
nations. Since only one write at a time is allowed, these counters can be 1-bit
wide. The rightmost 16 columns in Fig. 4-43 show the scoreboard.
+
 
Search WWH ::




Custom Search