Instruction-Level Parallelism and Its Exploitation - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

change was the most significant change in processor architecture since the early days of

pipelining and instruction-level parallelism some 25+ years earlier.

During the same period, designers began to explore the use of more data-level parallelism as

another approach to obtaining performance. SIMD extensions enabled desktop and server mi-

croprocessors to achieve moderate performance increases for graphics and similar functions.

More importantly, graphics processing units (GPUs) pursued aggressive use of SIMD, achiev-

ing significant performance advantages for applications with extensive data-level parallelism.

For scientific applications, such approaches represent a viable alternative to the more general,

but less efficient, thread-level parallelism exploited in multicores. The next chapter explores

these developments in the use of data-level parallelism.

Many researchers predicted a major retrenchment in the use of ILP, predicting that two is-

sue superscalar processors and larger numbers of cores would be the future. The advantages,

however, of slightly higher issue rates and the ability of speculative dynamic scheduling to

deal with unpredictable events, such as level-one cache misses, led to moderate ILP being

the primary building block in multicore designs. The addition of SMT and its efectiveness

(both for performance and energy efficiency) further cemented the position of the moderate

issue, out-of-order, speculative approaches. Indeed, even in the embedded market, the newest

processors (e.g., the ARM Cortex-A9) have introduced dynamic scheduling, speculation, and

wider issues rates.

It is highly unlikely that future processors will try to increase the width of issue signiic-

antly. It is simply too inefficient both from the viewpoint of silicon utilization and power ei-

ciency. Consider the data in Figure 3.47 that show the most recent four processors in the IBM

Power series. Over the past decade, there has been a modest improvement in the ILP support

in the Power processors, but the dominant portion of the increase in transistor count (a factor

of almost 7 from the Power 4 to the Power7) went to increasing the caches and the number of

cores per die. Even the expansion in SMT support seems to be more a focus than an increase

in the ILP throughput: The ILP structure from Power4 to Power7 went from 5 issues to 6, from

8 functional units to 12 (but not increasing from the original 2 load/store units), while the SMT

support went from nonexistent to 4 threads/processor. It seems clear that even for the most ad-

vanced ILP processor in 2011 (the Power7), the focus has moved beyond instruction-level par-

allelism. The next two chapters focus on approaches that exploit data-level and thread-level

parallelism.

Search WWH ::

Custom Search

Home