Hardware Reference
In-Depth Information
FIGURE 3.28 How four different approaches use the functional unit execution slots of a
superscalar processor . The horizontal dimension represents the instruction execution cap-
ability in each clock cycle. The vertical dimension represents a sequence of clock cycles. An
empty (white) box indicates that the corresponding execution slot is unused in that clock
cycle. The shades of gray and black correspond to four different threads in the multithreading
processors. Black is also used to indicate the occupied issue slots in the case of the super-
scalar without multithreading support. The Sun T1 and T2 (aka Niagara) processors are fine-
grained multithreaded processors, while the Intel Core i7 and IBM Power7 processors use
SMT. The T2 has eight threads, the Power7 has four, and the Intel i7 has two. In all existing
SMTs, instructions issue from only one thread at a time. The difference in SMT is that the sub-
sequent decision to execute an instruction is decoupled and could execute the operations
coming from several different instructions in the same clock cycle.
In the superscalar without multithreading support, the use of issue slots is limited by a lack
of ILP, including ILP to hide memory latency. Because of the length of L2 and L3 cache misses,
much of the processor can be left idle.
In the coarse-grained multithreaded superscalar, the long stalls are partially hidden by
switching to another thread that uses the resources of the processor. This switching reduces
the number of completely idle clock cycles. In a coarse-grained multithreaded processor,
however, thread switching only occurs when there is a stall. Because the new thread has a
start-up period, there are likely to be some fully idle cycles remaining.
In the fine-grained case, the interleaving of threads can eliminate fully empty slots. In ad-
dition, because the issuing thread is changed on every clock cycle, longer latency operations
can be hidden. Because instruction issue and execution are connected, a thread can only issue
as many instructions as are ready. With a narrow issue width this is not a problem (a cycle is
either occupied or not), which is why fine-grained multithreading works perfectly for a single
issue processor, and SMT would make no sense. Indeed, in the Sun T2, there are two issues
per clock, but they are from different threads. This eliminates the need to implement the com-
plex dynamic scheduling approach and relies instead on hiding latency with more threads.
If one implements fine-grained threading on top of a multiple-issue dynamically schedule
processor, the result is SMT. In all existing SMT implementations, all issues come from one
thread, although instructions from different threads can initiate execution in the same cycle,
 
Search WWH ::




Custom Search