Instruction-Level Parallelism and Its Exploitation - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

control variable is incremented on every loop iteration, the loop contains at least one de-

pendence. As we show in Appendix H, loop unrolling and aggressive algebraic optimiza-

tion can remove such dependent computation. Wall's study includes a limited amount of

such optimizations, but applying them more aggressively could lead to increased amounts

of ILP. In addition, certain code generation conventions introduce unneeded dependences,

in particular the use of return address registers and a register for the stack pointer (which

is incremented and decremented in the call/return sequence). Wall removes the effect of the

return address register, but the use of a stack pointer in the linkage convention can cause

“unnecessary” dependences. Postif et al. [1999] explored the advantages of removing this

constraint.

3. Overcoming the data flow limit —If value prediction worked with high accuracy, it could

overcome the data flow limit. As of yet, none of the more than 100 papers on the subject has

achieved a significant enhancement in ILP when using a realistic prediction scheme. Ob-

viously, perfect data value prediction would lead to effectively infinite parallelism, since

every value of every instruction could be predicted a priori .

For a less-than-perfect processor, several ideas have been proposed that could expose more

ILP. One example is to speculate along multiple paths. This idea was discussed by Lam and

Wilson [1992] and explored in the study covered in this section. By speculating on multiple

paths, the cost of incorrect recovery is reduced and more parallelism can be uncovered. It only

makes sense to evaluate this scheme for a limited number of branches because the hardware

resources required grow exponentially. Wall [1993] provided data for speculating in both dir-

ections on up to eight branches. Given the costs of pursuing both paths, knowing that one will

be thrown away (and the growing amount of useless computation as such a process is fol-

lowed through multiple branches), every commercial design has instead devoted additional

hardware to beter speculation on the correct path.

It is critical to understand that none of the limits in this section is fundamental in the

sense that overcoming them requires a change in the laws of physics! Instead, they are prac-

tical limitations that imply the existence of some formidable barriers to exploiting addition-

al ILP. These limitations—whether they be window size, alias detection, or branch predic-

tion—represent challenges for designers and researchers to overcome.

Atempts to break through these limits in the irst ive years of this century met with frus-

tration. Some techniques produced small improvements, but often at significant increases in

complexity, increases in the clock cycle, and disproportionate increases in power. In summary,

designers discovered that trying to extract more ILP was simply too inefficient. We will return

to this discussion in our concluding remarks.

3.11 Cross-Cutting Issues: ILP Approaches and the

Memory System

Hardware Versus Software Speculation

The hardware-intensive approaches to speculation in this chapter and the software ap-

proaches of Appendix H provide alternative approaches to exploiting ILP. Some of the trade-

ofs, and the limitations, for these approaches are listed below:

■ To speculate extensively, we must be able to disambiguate memory references. This cap-

ability is difficult to do at compile time for integer programs that contain pointers. In a

hardware-based scheme, dynamic runtime disambiguation of memory addresses is done

using the techniques we saw earlier for Tomasulo's algorithm. This disambiguation allows

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home