Instruction-Level Parallelism and Its Exploitation - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

Pitfall Sometimes Bigger And Dumber Is Better

Much of the atention in the early 2000s went to building aggressive processors to exploit ILP,

including the Pentium 4 architecture, which used the deepest pipeline ever seen in a micro-

processor, and the Intel Itanium, which had the highest peak issue rate per clock ever seen.

What quickly became clear was that the main limitation in exploiting ILP often turned out to

be the memory system. Although speculative out-of-order pipelines were fairly good at hid-

ing a significant fraction of the 10- to 15-cycle miss penalties for a first-level miss, they could

do very litle to hide the penalties for a second-level miss that, when going to main memory,

were likely to be 50 to100 clock cycles.

The result was that these designs never came close to achieving the peak instruction

throughput despite the large transistor counts and extremely sophisticated and clever tech-

niques. The next section discusses this dilemma and the turning away from more aggressive

ILP schemes to multicore, but there was another change that exemplifies this pitfall. Instead

of trying to hide even more memory latency with ILP, designers simply used the transistors

to build much larger caches. Both the Itanium 2 and the i7 use three-level caches compared to

the two-level cache of the Pentium 4, and the third-level caches are 9 MB and 8 MB compared

to the 2 MB second-level cache of the Pentium 4. Needless to say, building larger caches is a

lot easier than designing the 20+ -stage Pentium 4 pipeline and, from the data in Figure 3.46 ,

seems to be more effective.

3.15 Concluding Remarks: What's Ahead?

As 2000 began, the focus on exploiting instruction-level parallelism was at its peak. Intel was

about to introduce Itanium, a high-issue-rate statically scheduled processor that relied on a

VLIW-like approach with intensive compiler support. MIPS, Alpha, and IBM processors with

dynamically scheduled speculative execution were in their second generation and had gotten

wider and faster. The Pentium 4, which used speculative scheduling, had also been announced

that year with seven functional units and a pipeline more than 20 stages deep. But there were

storm clouds on the horizon.

Research such as that covered in Section 3.10 was showing that pushing ILP much further

would be extremely difficult, and, while peak instruction throughput rates had risen from the

irst speculative processors some 3 to 5 years earlier, sustained instruction execution rates were

growing much more slowly.

The next five years were telling. The Itanium turned out to be a good FP processor but only a

mediocre integer processor. Intel still produces the line, but there are not many users, the clock

rate lags the mainline Intel processors, and Microsoft no longer supports the instruction set.

The Intel Pentium 4, while achieving good performance, turned out to be inefficient in terms

of performance/wat (i.e., energy use), and the complexity of the processor made it unlikely

that further advances would be possible by increasing the issue rate. The end of a 20-year road

of achieving new performance levels in microprocessors by exploiting ILP had come. The Pen-

tium 4 was widely acknowledged to have gone beyond the point of diminishing returns, and

the aggressive and sophisticated Netburst microarchitecture was abandoned.

By 2005, Intel and all the other major processor manufacturers had revamped their approach

to focus on multicore. Higher performance would be achieved through thread-level parallel-

ism rather than instruction-level parallelism, and the responsibility for using the processor

eiciently would largely shift from the hardware to the software and the programmer. This

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home