Hardware Reference
In-Depth Information
not the only way. The performance gains from the 80386 through the 80486, Pen-
tium, and later designs like the Core i7 are due to better implementations, as the ar-
chitecture has remained essentially the same through all of them.
Some kinds of improvements can be made only by changing the architecture.
Sometimes these changes are incremental, such as adding new instructions or reg-
isters, so that old programs will continue to run on the new models. In this case, to
get the full performance, the software must be changed, or at least recompiled with
a new compiler that takes advantage of the new features.
However, once in a few decades, designers realize that the old architecture has
outlived its usefulness and that the only way to make progress is start all over
again. The RISC revolution in the 1980s was one such breakthrough; another one
is in the air now. We will look at one example (the Intel IA-64) in Chap. 5.
In the rest of this section we will look at four different techniques for im-
proving CPU performance. We will start with three well-established implemen-
tation improvements and then move on to one that needs a little architectural sup-
port to work best. These techniques are cache memory, branch prediction, out-of-
order execution with register renaming, and speculative execution.
4.5.1 Cache Memory
One of the most challenging aspects of computer design throughout history has
been to provide a memory system able to provide operands to the processor at the
speed it can process them. The recent high rate of growth in processor speed has
not been accompanied by a corresponding speedup in memories. Relative to
CPUs, memories have been getting slower for decades. Given the enormous
importance of primary memory, this situation has greatly limited the development
of high-performance systems and has stimulated research on ways to get around
the problem of memory speeds that are much slower than CPU speeds and, rel-
atively speaking, getting worse every year.
Modern processors place overwhelming demands on a memory system, in
terms of both latency (the delay in supplying an operand) and bandwidth (the
amount of data supplied per unit of time). Unfortunately, these two aspects of a
memory system are largely at odds. Many techniques for increasing bandwidth do
so only by increasing latency. For example, the pipelining techniques used in the
Mic-3 can be applied to a memory system, with multiple, overlapping memory re-
quests handled efficiently. Unfortunately, as with the Mic-3, this results in greater
latency for individual memory operations. As processor clock speeds get faster, it
becomes more and more difficult to provide a memory system capable of supply-
ing operands in one or two clock cycles.
One way to attack this problem is by providing caches. As we saw in Sec.
2.2.5, a cache holds the most recently used memory words in a small, fast memory,
speeding up access to them. If a large enough percentage of the memory words
needed are in the cache, the effective memory latency can be reduced enormously.
 
 
Search WWH ::




Custom Search