Hardware Reference
In-Depth Information
Pitfall Sometimes Bigger And Dumber Is Better
Much of the atention in the early 2000s went to building aggressive processors to exploit ILP,
including the Pentium 4 architecture, which used the deepest pipeline ever seen in a micro-
processor, and the Intel Itanium, which had the highest peak issue rate per clock ever seen.
What quickly became clear was that the main limitation in exploiting ILP often turned out to
be the memory system. Although speculative out-of-order pipelines were fairly good at hid-
ing a significant fraction of the 10- to 15-cycle miss penalties for a first-level miss, they could
do very litle to hide the penalties for a second-level miss that, when going to main memory,
were likely to be 50 to100 clock cycles.
The result was that these designs never came close to achieving the peak instruction
throughput despite the large transistor counts and extremely sophisticated and clever tech-
niques. The next section discusses this dilemma and the turning away from more aggressive
ILP schemes to multicore, but there was another change that exemplifies this pitfall. Instead
of trying to hide even more memory latency with ILP, designers simply used the transistors
to build much larger caches. Both the Itanium 2 and the i7 use three-level caches compared to
the two-level cache of the Pentium 4, and the third-level caches are 9 MB and 8 MB compared
to the 2 MB second-level cache of the Pentium 4. Needless to say, building larger caches is a
lot easier than designing the 20+ -stage Pentium 4 pipeline and, from the data in Figure 3.46 ,
seems to be more effective.
3.15 Concluding Remarks: What's Ahead?
As 2000 began, the focus on exploiting instruction-level parallelism was at its peak. Intel was
about to introduce Itanium, a high-issue-rate statically scheduled processor that relied on a
VLIW-like approach with intensive compiler support. MIPS, Alpha, and IBM processors with
dynamically scheduled speculative execution were in their second generation and had gotten
wider and faster. The Pentium 4, which used speculative scheduling, had also been announced
that year with seven functional units and a pipeline more than 20 stages deep. But there were
storm clouds on the horizon.
Research such as that covered in Section 3.10 was showing that pushing ILP much further
would be extremely difficult, and, while peak instruction throughput rates had risen from the
irst speculative processors some 3 to 5 years earlier, sustained instruction execution rates were
growing much more slowly.
The next five years were telling. The Itanium turned out to be a good FP processor but only a
mediocre integer processor. Intel still produces the line, but there are not many users, the clock
rate lags the mainline Intel processors, and Microsoft no longer supports the instruction set.
The Intel Pentium 4, while achieving good performance, turned out to be inefficient in terms
of performance/wat (i.e., energy use), and the complexity of the processor made it unlikely
that further advances would be possible by increasing the issue rate. The end of a 20-year road
of achieving new performance levels in microprocessors by exploiting ILP had come. The Pen-
tium 4 was widely acknowledged to have gone beyond the point of diminishing returns, and
the aggressive and sophisticated Netburst microarchitecture was abandoned.
By 2005, Intel and all the other major processor manufacturers had revamped their approach
to focus on multicore. Higher performance would be achieved through thread-level parallel-
ism rather than instruction-level parallelism, and the responsibility for using the processor
eiciently would largely shift from the hardware to the software and the programmer. This
Search WWH ::




Custom Search