Hardware Reference
In-Depth Information
By similar reasoning, we cannot allow such instructions to cause the cache to stall on a miss
because again unnecessary stalls could overwhelm the benefits of speculation. Hence, these
processors must be matched with nonblocking caches.
In reality, the penalty of an L2 miss is so large that compilers normally only speculate on L1
misses. Figure 2.5 on page 84 shows that for some well-behaved scientific programs the com-
piler can sustain multiple outstanding L2 misses to cut the L2 miss penalty effectively. Once
again, for this to work the memory system behind the cache must match the goals of the com-
piler in number of simultaneous memory accesses.
3.12 Multithreading: Exploiting Thread-Level
Parallelism to Improve Uniprocessor Throughput
The topic we cover in this section, multithreading, is truly a cross-cuting topic, since it has
relevance to pipelining and superscalars, to graphics processing units ( Chapter 4 ), and to mul-
tiprocessors ( Chapter 5 ). We introduce the topic here and explore the use of multithreading
to increase uniprocessor throughput by using multiple threads to hide pipeline and memory
latencies. In the next chapter, we will see how multithreading provides the same advantages
in GPUs, and finally, Chapter 5 will explore the combination of multithreading and multi-
processing. These topics are closely interwoven, since multithreading is a primary technique
for exposing more parallelism to the hardware. In a strict sense, multithreading uses thread-
level parallelism, and thus is properly the subject of Chapter 5 , but its role in both improving
pipeline utilization and in GPUs motivates us to introduce the concept here.
Although increasing performance by using ILP has the great advantage that it is reasonably
transparent to the programmer, as we have seen ILP can be quite limited or difficult to exploit
in some applications. In particular, with reasonable instruction issue rates, cache misses that
go to memory or off-chip caches are unlikely to be hidden by available ILP. Of course, when
the processor is stalled waiting on a cache miss, the utilization of the functional units drops
dramatically.
Since atempts to cover long memory stalls with more ILP have limited efectiveness, it is
natural to ask whether other forms of parallelism in an application could be used to hide
memory delays. For example, an online transaction-processing system has natural parallelism
among the multiple queries and updates that are presented by requests. Of course, many sci-
entific applications contain natural parallelism since they often model the three-dimension-
al, parallel structure of nature, and that structure can be exploited by using separate threads.
Even desktop applications that use modern Windows-based operating systems often have
multiple active applications running, providing a source of parallelism.
Multithreading allows multiple threads to share the functional units of a single processor in
an overlapping fashion. In contrast, a more general method to exploit thread-level parallelism
(TLP) is with a multiprocessor that has multiple independent threads operating at once and
in parallel. Multithreading, however, does not duplicate the entire processor as a multipro-
cessor does. Instead, multithreading shares most of the processor core among a set of threads,
duplicating only private state, such as the registers and program counter. As we will see in
Chapter 5 , many recent processors incorporate both multiple processor cores on a single chip
and provide multithreading within each core.
Duplicating the per-thread state of a processor core means creating a separate register ile,
a separate PC, and a separate page table for each thread. The memory itself can be shared
through the virtual memory mechanisms, which already support multiprogramming. In addi-
Search WWH ::




Custom Search