Hardware Reference
In-Depth Information
in program order. (We really only need to keep the relative order between stores and other
memory references; that is, loads can be reordered freely.)
Let's consider the situation of a load first. If we perform effective address calculation in
program order, then when a load has completed effective address calculation, we can check
whether there is an address conflict by examining the A field of all active store buffers. If the
load address matches the address of any active entries in the store buffer, that load instruc-
tion is not sent to the load buffer until the conflicting store completes. (Some implementations
bypass the value directly to the load from a pending store, reducing the delay for this RAW
hazard.)
Stores operate similarly, except that the processor must check for conflicts in both the load
buffers and the store buffers, since conflicting stores cannot be reordered with respect to either
a load or a store.
A dynamically scheduled pipeline can yield very high performance, provided branches are
predicted accurately—an issue we addressed in the last section. The major drawback of this
approach is the complexity of the Tomasulo scheme, which requires a large amount of hard-
ware. In particular, each reservation station must contain an associative buffer, which must
run at high speed, as well as complex control logic. The performance can also be limited by
the single CDB. Although additional CDBs can be added, each CDB must interact with each
reservation station, and the associative tag-matching hardware would have to be duplicated
at each station for each CDB.
In Tomasulo's scheme, two different techniques are combined: the renaming of the architec-
tural registers to a larger set of registers and the buffering of source operands from the register
ile. Source operand buffering resolves WAR hazards that arise when the operand is available
in the registers. As we will see later, it is also possible to eliminate WAR hazards by the renam-
ing of a register together with the buffering of a result until no outstanding references to the
earlier version of the register remain. This approach will be used when we discuss hardware
speculation.
Tomasulo's scheme was unused for many years after the 360/91, but was widely adopted in
multiple-issue processors starting in the 1990s for several reasons:
1. Although Tomasulo's algorithm was designed before caches, the presence of caches, with
the inherently unpredictable delays, has become one of the major motivations for dynamic
scheduling. Out-of-order execution allows the processors to continue executing instruc-
tions while awaiting the completion of a cache miss, thus hiding all or part of the cache
miss penalty.
2. As processors became more aggressive in their issue capability and designers are con-
cerned with the performance of difficult-to-schedule code (such as most nonnumeric code),
techniques such as register renaming, dynamic scheduling, and speculation became more
important.
3. It can achieve high performance without requiring the compiler to target code to a speciic
pipeline structure, a valuable property in the era of shrink-wrapped mass market software.
3.6 Hardware-Based Speculation
As we try to exploit more instruction-level parallelism, maintaining control dependences
becomes an increasing burden. Branch prediction reduces the direct stalls atributable to
branches, but for a processor executing multiple instructions per clock, just predicting
branches accurately may not be sufficient to generate the desired amount of instruction-level
parallelism. A wide issue processor may need to execute a branch every clock cycle to main-
 
Search WWH ::




Custom Search