Hardware Reference
In-Depth Information
change was the most significant change in processor architecture since the early days of
pipelining and instruction-level parallelism some 25+ years earlier.
During the same period, designers began to explore the use of more data-level parallelism as
another approach to obtaining performance. SIMD extensions enabled desktop and server mi-
croprocessors to achieve moderate performance increases for graphics and similar functions.
More importantly, graphics processing units (GPUs) pursued aggressive use of SIMD, achiev-
ing significant performance advantages for applications with extensive data-level parallelism.
For scientific applications, such approaches represent a viable alternative to the more general,
but less efficient, thread-level parallelism exploited in multicores. The next chapter explores
these developments in the use of data-level parallelism.
Many researchers predicted a major retrenchment in the use of ILP, predicting that two is-
sue superscalar processors and larger numbers of cores would be the future. The advantages,
however, of slightly higher issue rates and the ability of speculative dynamic scheduling to
deal with unpredictable events, such as level-one cache misses, led to moderate ILP being
the primary building block in multicore designs. The addition of SMT and its efectiveness
(both for performance and energy efficiency) further cemented the position of the moderate
issue, out-of-order, speculative approaches. Indeed, even in the embedded market, the newest
processors (e.g., the ARM Cortex-A9) have introduced dynamic scheduling, speculation, and
wider issues rates.
It is highly unlikely that future processors will try to increase the width of issue signiic-
antly. It is simply too inefficient both from the viewpoint of silicon utilization and power ei-
ciency. Consider the data in Figure 3.47 that show the most recent four processors in the IBM
Power series. Over the past decade, there has been a modest improvement in the ILP support
in the Power processors, but the dominant portion of the increase in transistor count (a factor
of almost 7 from the Power 4 to the Power7) went to increasing the caches and the number of
cores per die. Even the expansion in SMT support seems to be more a focus than an increase
in the ILP throughput: The ILP structure from Power4 to Power7 went from 5 issues to 6, from
8 functional units to 12 (but not increasing from the original 2 load/store units), while the SMT
support went from nonexistent to 4 threads/processor. It seems clear that even for the most ad-
vanced ILP processor in 2011 (the Power7), the focus has moved beyond instruction-level par-
allelism. The next two chapters focus on approaches that exploit data-level and thread-level
parallelism.
Search WWH ::




Custom Search