Hardware Reference
In-Depth Information
using the dynamic scheduling hardware to determine what instructions are ready. Although
Figure 3.28 greatly simplifies the real operation of these processors, it does illustrate the po-
tential performance advantages of multithreading in general and SMT in wider issue, dynam-
ically scheduled processors.
Simultaneous multithreading uses the insight that a dynamically scheduled processor
already has many of the hardware mechanisms needed to support the mechanism, including
a large virtual register set. Multithreading can be built on top of an out-of-order processor by
adding a per-thread renaming table, keeping separate PCs, and providing the capability for
instructions from multiple threads to commit.
Effectiveness Of Fine-Grained Multithreading On The Sun T1
In this section, we use the Sun T1 processor to examine the ability of multithreading to hide
latency. The T1 is a fine-grained multithreaded multicore microprocessor introduced by Sun
in 2005. What makes T1 especially interesting is that it is almost totally focused on exploit-
ing thread-level parallelism (TLP) rather than instruction-level parallelism (ILP). The T1 aban-
doned the intense focus on ILP (just shortly after the most aggressive ILP processors ever were
introduced), returned to a simple pipeline strategy, and focused on exploiting TLP, using both
multiple cores and multithreading to produce throughput.
Each T1 processor contains eight processor cores, each supporting four threads. Each pro-
cessor core consists of a simple six-stage, single-issue pipeline (a standard five-stage RISC
pipeline like that of Appendix C , with one stage added for thread switching). T1 uses ine-
grained multithreading (but not SMT), switching to a new thread on each clock cycle, and
threads that are idle because they are waiting due to a pipeline delay or cache miss are by-
passed in the scheduling. The processor is idle only when all four threads are idle or stalled.
Both loads and branches incur a three-cycle delay that can only be hidden by other threads. A
single set of floating-point functional units is shared by all eight cores, as floating-point per-
formance was not a focus for T1. Figure 3.29 summarizes the T1 processor.
FIGURE 3.29 A summary of the T1 processor .
 
Search WWH ::




Custom Search