THE MICROARCHITECTURE LEVEL - Structured Computer Organization

Hardware Reference

In-Depth Information

A possible translation to assembly language is shown in Fig. 4-40(b). We will

study assembly language later in this topic, and the details are not important now,

but depending on the machine and the compiler, code more or less like that of

Fig. 4-40(b) is likely. The first instruction compares i to 0. The second one

branches to the label Else (the start of the else clause) if i is not 0. The third in-

struction assigns 1 to k . The fourth instruction branches to the code for the next

statement. The compiler has conveniently planted a label, Next , there, so there is a

place to branch to. The fifth instruction assigns 2 to k .

The thing to observe here is that two of the five instructions are branches. Fur-

thermore, one of these, BNE , is a conditional branch (a branch that is taken if and

only if some condition is met—in this case, that the two operands in the previous

CMP are unequal). The longest linear code sequence here is two instructions. As a

consequence, fetching instructions at a high rate to feed the pipeline is very hard.

At first glance, it might appear that unconditional branches, such as the in-

struction BR Next in Fig. 4-40(b), are not a problem. After all, there is no ambigu-

ity about where to go. Why can the fetch unit not just continue to read instructions

from the target address (the place that will be branched to)?

The trouble lies in the nature of pipelining. In Fig. 4-36, for example, we see

that instruction decoding occurs in the second stage. Thus the fetch unit has to

decide where to fetch from next before it knows what kind of instruction it just got.

Only one cycle later can it learn that it just picked up an unconditional branch, and

by then it has already started to fetch the instruction following the unconditional

branch. As a consequence, a substantial number of pipelined machines (such as

the UltraSPARC III) have the property that the instruction following an uncondi-

tional branch is executed, even though logically it should not be. The position after

a branch is called a delay slot . The Core i7 [and the machine used in Fig. 4-40(b)]

do not have this property, but the internal complexity to get around this problem is

often enormous. An optimizing compiler will try to find some useful instruction to

put in the delay slot, but frequently there is nothing available, so it is forced to

insert a NOP instruction there. Doing so keeps the program correct, but makes it

bigger and slower.

Annoying as unconditional branches are, conditional branches are worse. Not

only do they also have delay slots, but now the fetch unit does not know where to

read from until much later in the pipeline. Early pipelined machines just stalled

until it was known whether the branch would be taken or not. Stalling for three or

four cycles on every conditional branch, especially if 20% of the instructions are

conditional branches, wreaks havoc with the performance.

Consequently, what most machines do when they hit a conditional branch is

predict whether it is going to be taken or not. It would be nice if we could just

plug a crystal ball into a free PCIe slot (or better yet, into the IFU) to help out with

the prediction, but so far this approach has not borne fruit.

Lacking such a nice peripheral, various ways have been devised to do the pre-

diction. One very simple way is as follows: assume that all backward conditional

Structured Computer Organization

Search WWH ::

Custom Search

Home