Instruction-Level Parallelism and Its Exploitation - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

FIGURE 3.28 How four different approaches use the functional unit execution slots of a

superscalar processor . The horizontal dimension represents the instruction execution cap-

ability in each clock cycle. The vertical dimension represents a sequence of clock cycles. An

empty (white) box indicates that the corresponding execution slot is unused in that clock

cycle. The shades of gray and black correspond to four different threads in the multithreading

processors. Black is also used to indicate the occupied issue slots in the case of the super-

scalar without multithreading support. The Sun T1 and T2 (aka Niagara) processors are fine-

grained multithreaded processors, while the Intel Core i7 and IBM Power7 processors use

SMT. The T2 has eight threads, the Power7 has four, and the Intel i7 has two. In all existing

SMTs, instructions issue from only one thread at a time. The difference in SMT is that the sub-

sequent decision to execute an instruction is decoupled and could execute the operations

coming from several different instructions in the same clock cycle.

In the superscalar without multithreading support, the use of issue slots is limited by a lack

of ILP, including ILP to hide memory latency. Because of the length of L2 and L3 cache misses,

much of the processor can be left idle.

In the coarse-grained multithreaded superscalar, the long stalls are partially hidden by

switching to another thread that uses the resources of the processor. This switching reduces

the number of completely idle clock cycles. In a coarse-grained multithreaded processor,

however, thread switching only occurs when there is a stall. Because the new thread has a

start-up period, there are likely to be some fully idle cycles remaining.

In the fine-grained case, the interleaving of threads can eliminate fully empty slots. In ad-

dition, because the issuing thread is changed on every clock cycle, longer latency operations

can be hidden. Because instruction issue and execution are connected, a thread can only issue

as many instructions as are ready. With a narrow issue width this is not a problem (a cycle is

either occupied or not), which is why fine-grained multithreading works perfectly for a single

issue processor, and SMT would make no sense. Indeed, in the Sun T2, there are two issues

per clock, but they are from different threads. This eliminates the need to implement the com-

plex dynamic scheduling approach and relies instead on hiding latency with more threads.

If one implements fine-grained threading on top of a multiple-issue dynamically schedule

processor, the result is SMT. In all existing SMT implementations, all issues come from one

thread, although instructions from different threads can initiate execution in the same cycle,

Search WWH ::

Custom Search

Home