Thread-Level Parallelism - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

FIGURE 5.11 The execution time breakdown for the three programs (OLTP, DSS, and

AltaVista) in the commercial workload . The DSS numbers are the average across six dif-

ferent queries. The CPI varies widely from a low of 1.3 for AltaVista, to 1.61 for the DSS quer-

ies, to 7.0 for OLTP. (Individually, the DSS queries show a CPI range of 1.3 to 1.9.) “Other

stalls” includes resource stalls (implemented with replay traps on the 21164), branch mispre-

dict, memory barrier, and TLB misses. For these benchmarks, resource-based pipeline stalls

are the dominant factor. These data combine the behavior of user and kernel accesses. Only

OLTP has a significant fraction of kernel accesses, and the kernel accesses tend to be better

behaved than the user accesses! All the measurements shown in this section were collected

Since the OLTP workload demands the most from the memory system with large numbers

of expensive L3 misses, we focus on examining the impact of L3 cache size, processor count,

and block size on the OLTP benchmark. Figure 5.12 shows the effect of increasing the cache

size, using two-way set associative caches, which reduces the large number of conflict misses.

The execution time is improved as the L3 cache grows due to the reduction in L3 misses. Sur-

prisingly, almost all of the gain occurs in going from 1 to 2 MB, with litle additional gain bey-

ond that, despite the fact that cache misses are still a cause of significant performance loss with

2 MB and 4 MB caches. The question is, Why?

Search WWH ::

Custom Search

Home