Hardware Reference
In-Depth Information
Static Branch Prediction
All of the branch prediction techniques discussed so far are dynamic, that is,
are carried out at run time while the program is running. They also adapt to the
program's current behavior, which is good. The down side is that they require spe-
cialized and expensive hardware and a great deal of chip complexity.
A different way to go is to have the compiler help out. When the compiler sees
a statement like
for (i = 0; i < 1000000; i++) { ... }
it knows very well that the branch at the end of the loop will be taken nearly all the
time. If only it had a way to tell the hardware, a lot of effort could be saved.
Although this is an architectural change (and not just an implementation issue),
some machines, such as the UltraSPARC III, have a second set of conditional
branch instructions, in addition to the regular ones (which are needed for backward
compatibility). The new ones contain a bit in which the compiler can specify that
it thinks the branch will be taken (or not taken). When one of these is encountered,
the fetch unit just does what it has been told. Furthermore, there is no need to
waste precious space in the branch history table for these instructions, thus reduc-
ing conflicts there.
Finally, our last branch prediction technique is based on profiling (Fisher and
Freudenberger, 1992). This, too, is a static technique, but instead of having the
compiler try to figure out which branches will be taken and which will not, the pro-
gram is actually run (typically on a simulator) and the branch behavior captured.
This information is fed into the compiler, which then uses the special conditional
branch instructions to tell the hardware what to do.
4.5.3 Out-of-Order Execution and Register Renaming
Most modern CPUs are both pipelined and superscalar, as shown in Fig. 2-6.
What this generally means is that a fetch unit pulls instruction words out of memo-
ry before they are needed in order to feed a decode unit. The decode unit issues
the decoded instructions to the proper functional units for execution. In some
cases it may break individual instructions into micro-ops before issuing them, de-
pending on what the functional units can do.
Clearly, the machine design is simplest if all instructions are executed in the
order they are fetched (assuming for the moment that the branch prediction algo-
rithm never guesses wrong). However, in-order execution does not always give
optimal performance due to dependences between instructions. If an instruction
needs a value computed by the previous instruction, the second one cannot begin
executing until the first one has produced the needed value. In this situation (a
RAW dependence), the second instruction has to wait. Other kinds of dependences
also exist, as we will soon see.
 
 
Search WWH ::




Custom Search