Hardware Reference
In-Depth Information
T1 Multithreading Unicore Performance
The T1 makes TLP its focus, both through the multithreading on an individual core and
through the use of many simple cores on a single die. In this section, we will look at the efect-
iveness of the T1 in increasing the performance of a single core through fine-grained multith-
reading. In Chapter 5 , we will return to examine the effectiveness of combining multithread-
ing with multiple cores.
To examine the performance of the T1, we use three server-oriented benchmarks: TPC-C,
SPECJBB (the SPEC Java Business Benchmark), and SPECWeb99. Since multiple threads in-
crease the memory demands from a single processor, they could overload the memory system,
leading to reductions in the potential gain from multithreading. Figure 3.30 shows the relative
increase in the miss rate and the observed miss latency when executing with one thread per
core versus executing four threads per core for TPC-C. Both the miss rates and the miss laten-
cies increase, due to increased contention in the memory system. The relatively small increase
in miss latency indicates that the memory system still has unused capacity.
FIGURE 3.30 The relative change in the miss rates and miss latencies when executing
with one thread per core versus four threads per core on the TPC-C benchmark . The
latencies are the actual time to return the requested data after a miss. In the four-thread case,
the execution of other threads could potentially hide much of this latency.
By looking at the behavior of an average thread, we can understand the interaction among
the threads and their ability to keep a core busy. Figure 3.31 shows the percentage of cycles for
which a thread is executing, ready but not executing, and not ready. Remember that not ready
does not imply that the core with that thread is stalled; it is only when all four threads are not
ready that the core will stall.
 
Search WWH ::




Custom Search