Hardware Reference
In-Depth Information
tion, the hardware must support the ability to change to a different thread relatively quickly;
in particular, a thread switch should be much more efficient than a process switch, which typ-
ically requires hundreds to thousands of processor cycles. Of course, for multithreading hard-
ware to achieve performance improvements, a program must contain multiple threads (we
sometimes say that the application is multithreaded) that could execute in concurrent fashion.
These threads are identified either by a compiler (typically from a language with parallelism
constructs) or by the programmer.
There are three main hardware approaches to multithreading. Fine-grained multithreading
switches between threads on each clock, causing the execution of instructions from multiple
threads to be interleaved. This interleaving is often done in a round-robin fashion, skipping
any threads that are stalled at that time. One key advantage of fine-grained multithreading is
that it can hide the throughput losses that arise from both short and long stalls, since instruc-
tions from other threads can be executed when one thread stalls, even if the stall is only for a
few cycles. The primary disadvantage of fine-grained multithreading is that it slows down the
execution of an individual thread, since a thread that is ready to execute without stalls will be
delayed by instructions from other threads. It trades an uncrease in multithreaded throughput
for a loss in the performance (as measured by latency) of a single thread. The Sun Niagara pro-
cessor, which we examine shortly, uses simple fine-grained multithreading, as do the Nvidia
GPUs, which we look at in the next chapter.
Coarse-grained multithreading was invented as an alternative to fine-grained multithreading.
Coarse-grained multithreading switches threads only on costly stalls, such as level two or
three cache misses. This change relieves the need to have thread-switching be essentially free
and is much less likely to slow down the execution of any one thread, since instructions from
other threads will only be issued when a thread encounters a costly stall.
Coarse-grained multithreading suffers, however, from a major drawback: It is limited in
its ability to overcome throughput losses, especially from shorter stalls. This limitation arises
from the pipeline start-up costs of coarse-grained multithreading. Because a CPU with coarse-
grained multithreading issues instructions from a single thread, when a stall occurs the
pipeline will see a bubble before the new thread begins executing. Because of this start-up
overhead, coarse-grained multithreading is much more useful for reducing the penalty of very
high-cost stalls, where pipeline refill is negligible compared to the stall time. Several research
projects have explored coarse grained multithreading, but no major current processors use this
technique.
The most common implementation of multithreading is called Simultaneous multithreading
(SMT). Simultaneous multithreading is a variation on fine-grained multithreading that arises
naturally when fine-grained multithreading is implemented on top of a multiple-issue, dy-
namically scheduled processor. As with other forms of multithreading, SMT uses thread-level
parallelism to hide long-latency events in a processor, thereby increasing the usage of the func-
tional units. The key insight in SMT is that register renaming and dynamic scheduling allow
multiple instructions from independent threads to be executed without regard to the depend-
ences among them; the resolution of the dependences can be handled by the dynamic schedul-
ing capability.
Figure 3.28 conceptually illustrates the differences in a processor's ability to exploit the re-
sources of a superscalar for the following processor configurations:
■ A superscalar with no multithreading support
■ A superscalar with coarse-grained multithreading
■ A superscalar with fine-grained multithreading
■ A superscalar with simultaneous multithreading
Search WWH ::




Custom Search