Instruction-Level Parallelism and Its Exploitation - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

in program order. (We really only need to keep the relative order between stores and other

memory references; that is, loads can be reordered freely.)

Let's consider the situation of a load first. If we perform effective address calculation in

program order, then when a load has completed effective address calculation, we can check

whether there is an address conflict by examining the A field of all active store buffers. If the

load address matches the address of any active entries in the store buffer, that load instruc-

tion is not sent to the load buffer until the conflicting store completes. (Some implementations

bypass the value directly to the load from a pending store, reducing the delay for this RAW

hazard.)

Stores operate similarly, except that the processor must check for conflicts in both the load

buffers and the store buffers, since conflicting stores cannot be reordered with respect to either

a load or a store.

A dynamically scheduled pipeline can yield very high performance, provided branches are

predicted accurately—an issue we addressed in the last section. The major drawback of this

approach is the complexity of the Tomasulo scheme, which requires a large amount of hard-

ware. In particular, each reservation station must contain an associative buffer, which must

run at high speed, as well as complex control logic. The performance can also be limited by

the single CDB. Although additional CDBs can be added, each CDB must interact with each

reservation station, and the associative tag-matching hardware would have to be duplicated

at each station for each CDB.

In Tomasulo's scheme, two different techniques are combined: the renaming of the architec-

tural registers to a larger set of registers and the buffering of source operands from the register

ile. Source operand buffering resolves WAR hazards that arise when the operand is available

in the registers. As we will see later, it is also possible to eliminate WAR hazards by the renam-

ing of a register together with the buffering of a result until no outstanding references to the

earlier version of the register remain. This approach will be used when we discuss hardware

speculation.

Tomasulo's scheme was unused for many years after the 360/91, but was widely adopted in

multiple-issue processors starting in the 1990s for several reasons:

1. Although Tomasulo's algorithm was designed before caches, the presence of caches, with

the inherently unpredictable delays, has become one of the major motivations for dynamic

scheduling. Out-of-order execution allows the processors to continue executing instruc-

tions while awaiting the completion of a cache miss, thus hiding all or part of the cache

miss penalty.

2. As processors became more aggressive in their issue capability and designers are con-

cerned with the performance of difficult-to-schedule code (such as most nonnumeric code),

techniques such as register renaming, dynamic scheduling, and speculation became more

important.

3. It can achieve high performance without requiring the compiler to target code to a speciic

pipeline structure, a valuable property in the era of shrink-wrapped mass market software.

3.6 Hardware-Based Speculation

As we try to exploit more instruction-level parallelism, maintaining control dependences

becomes an increasing burden. Branch prediction reduces the direct stalls atributable to

branches, but for a processor executing multiple instructions per clock, just predicting

branches accurately may not be sufficient to generate the desired amount of instruction-level

parallelism. A wide issue processor may need to execute a branch every clock cycle to main-

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home