Hardware Reference
In-Depth Information
A possible translation to assembly language is shown in Fig. 4-40(b). We will
study assembly language later in this topic, and the details are not important now,
but depending on the machine and the compiler, code more or less like that of
Fig. 4-40(b) is likely. The first instruction compares i to 0. The second one
branches to the label Else (the start of the else clause) if i is not 0. The third in-
struction assigns 1 to k . The fourth instruction branches to the code for the next
statement. The compiler has conveniently planted a label, Next , there, so there is a
place to branch to. The fifth instruction assigns 2 to k .
The thing to observe here is that two of the five instructions are branches. Fur-
thermore, one of these, BNE , is a conditional branch (a branch that is taken if and
only if some condition is met—in this case, that the two operands in the previous
CMP are unequal). The longest linear code sequence here is two instructions. As a
consequence, fetching instructions at a high rate to feed the pipeline is very hard.
At first glance, it might appear that unconditional branches, such as the in-
struction BR Next in Fig. 4-40(b), are not a problem. After all, there is no ambigu-
ity about where to go. Why can the fetch unit not just continue to read instructions
from the target address (the place that will be branched to)?
The trouble lies in the nature of pipelining. In Fig. 4-36, for example, we see
that instruction decoding occurs in the second stage. Thus the fetch unit has to
decide where to fetch from next before it knows what kind of instruction it just got.
Only one cycle later can it learn that it just picked up an unconditional branch, and
by then it has already started to fetch the instruction following the unconditional
branch. As a consequence, a substantial number of pipelined machines (such as
the UltraSPARC III) have the property that the instruction following an uncondi-
tional branch is executed, even though logically it should not be. The position after
a branch is called a delay slot . The Core i7 [and the machine used in Fig. 4-40(b)]
do not have this property, but the internal complexity to get around this problem is
often enormous. An optimizing compiler will try to find some useful instruction to
put in the delay slot, but frequently there is nothing available, so it is forced to
insert a NOP instruction there. Doing so keeps the program correct, but makes it
bigger and slower.
Annoying as unconditional branches are, conditional branches are worse. Not
only do they also have delay slots, but now the fetch unit does not know where to
read from until much later in the pipeline. Early pipelined machines just stalled
until it was known whether the branch would be taken or not. Stalling for three or
four cycles on every conditional branch, especially if 20% of the instructions are
conditional branches, wreaks havoc with the performance.
Consequently, what most machines do when they hit a conditional branch is
predict whether it is going to be taken or not. It would be nice if we could just
plug a crystal ball into a free PCIe slot (or better yet, into the IFU) to help out with
the prediction, but so far this approach has not borne fruit.
Lacking such a nice peripheral, various ways have been devised to do the pre-
diction. One very simple way is as follows: assume that all backward conditional
 
Search WWH ::




Custom Search