THE MICROARCHITECTURE LEVEL - Structured Computer Organization

Hardware Reference

In-Depth Information

Static Branch Prediction

All of the branch prediction techniques discussed so far are dynamic, that is,

are carried out at run time while the program is running. They also adapt to the

program's current behavior, which is good. The down side is that they require spe-

cialized and expensive hardware and a great deal of chip complexity.

A different way to go is to have the compiler help out. When the compiler sees

a statement like

for (i = 0; i < 1000000; i++) { ... }

it knows very well that the branch at the end of the loop will be taken nearly all the

time. If only it had a way to tell the hardware, a lot of effort could be saved.

Although this is an architectural change (and not just an implementation issue),

some machines, such as the UltraSPARC III, have a second set of conditional

branch instructions, in addition to the regular ones (which are needed for backward

compatibility). The new ones contain a bit in which the compiler can specify that

it thinks the branch will be taken (or not taken). When one of these is encountered,

the fetch unit just does what it has been told. Furthermore, there is no need to

waste precious space in the branch history table for these instructions, thus reduc-

ing conflicts there.

Finally, our last branch prediction technique is based on profiling (Fisher and

Freudenberger, 1992). This, too, is a static technique, but instead of having the

compiler try to figure out which branches will be taken and which will not, the pro-

gram is actually run (typically on a simulator) and the branch behavior captured.

This information is fed into the compiler, which then uses the special conditional

branch instructions to tell the hardware what to do.

Most modern CPUs are both pipelined and superscalar, as shown in Fig. 2-6.

What this generally means is that a fetch unit pulls instruction words out of memo-

ry before they are needed in order to feed a decode unit. The decode unit issues

the decoded instructions to the proper functional units for execution. In some

cases it may break individual instructions into micro-ops before issuing them, de-

pending on what the functional units can do.

Clearly, the machine design is simplest if all instructions are executed in the

order they are fetched (assuming for the moment that the branch prediction algo-

rithm never guesses wrong). However, in-order execution does not always give

optimal performance due to dependences between instructions. If an instruction

needs a value computed by the previous instruction, the second one cannot begin

executing until the first one has produced the needed value. In this situation (a

RAW dependence), the second instruction has to wait. Other kinds of dependences

also exist, as we will soon see.

Search WWH ::

Custom Search

Home