Hardware Reference
In-Depth Information
8.1.2 On-Chip Multithreading
All modern, pipelined CPUs have an inherent problem: when a memory refer-
ence misses the level 1 and level 2 caches, there is a long wait until the requested
word (and its associated cache line) are loaded into the cache, so the pipeline stalls.
One approach to dealing with this situation, called on-chip multithreading , allows
the CPU to manage multiple threads of control at the same time in an attempt to
mask these stalls. In short, if thread 1 is blocked, the CPU still has a chance of
running thread 2 in order to keep the hardware fully occupied.
Although the basic idea is fairly simple, multiple variants exist, which we will
now examine. The first approach, called fine-grained multithreading , is illustrat-
ed in Fig. 8-7 for a CPU with the ability to issue one instruction per clock cycle. In
Fig. 8-7(a)-(c), we see three threads, A , B , and C , for 12 machine cycles. During
the first cycle, thread A executes instruction A1 . This instruction completes in one
cycle, so in the second cycle instruction A2 is started. Unfortunately, this instruc-
tion misses on the level 1 cache so two cycles are wasted while the word needed is
fetched from the level 2 cache. The thread continues in cycle 5. Similarly, threads
B and C also stall occasionally as well, as illustrated in the figure. In this model if
an instruction stalls, subsequent instructions cannot be issued. Of course, with a
more sophisticated scoreboard, sometimes new instructions can still be issued, but
we will ignore that possibility in this discussion.
(a)
A1 A2
A3
A4
A5
A6
A7
A8
(d)
A1 B1 C1 A2 B2 C2 A3 B3 C3
A4
B4
C4
(b)
B1
B2
B3
B4 B5 B6
B7
B8
(c)
C1 C2 C3
C4
C5 C6
C7
C8
(e)
A1 A2
B1
C1 C2 C3
C4
A3
A4
A5
Cycle
Cycle
Figure 8-7. (a)-(c) Three threads. The empty boxes indicate that the thread has
stalled waiting for memory. (d) Fine-grained multithreading. (e) Coarse-grained
multithreading.
Fine-grained multithreading masks the stalls by running the threads round
robin, with a different thread in consecutive cycles, as shown in Fig. 8-7(d). By the
time the fourth cycle comes up, the memory operation initiated in A1 has com-
pleted, so instruction A2 can be run, even if it needs the result of A1 . In this case
the maximum stall is two cycles, so with three threads the stalled operation always
completes in time. If a memory stall took four cycles, we would need four threads
to insure continuous operation, and so on.
Since different threads have nothing to do with one another, each one needs its
own set of registers. When an instruction is issued, a pointer to its register set has
to be included along with the instruction so that if a register is referenced, the
 
 
 
Search WWH ::




Custom Search