Hardware Reference
In-Depth Information
FIGURE 5.11 The execution time breakdown for the three programs (OLTP, DSS, and
AltaVista) in the commercial workload . The DSS numbers are the average across six dif-
ferent queries. The CPI varies widely from a low of 1.3 for AltaVista, to 1.61 for the DSS quer-
ies, to 7.0 for OLTP. (Individually, the DSS queries show a CPI range of 1.3 to 1.9.) “Other
stalls” includes resource stalls (implemented with replay traps on the 21164), branch mispre-
dict, memory barrier, and TLB misses. For these benchmarks, resource-based pipeline stalls
are the dominant factor. These data combine the behavior of user and kernel accesses. Only
OLTP has a significant fraction of kernel accesses, and the kernel accesses tend to be better
behaved than the user accesses! All the measurements shown in this section were collected
by Barroso, Gharachorloo, and Bugnion [1998] .
Since the OLTP workload demands the most from the memory system with large numbers
of expensive L3 misses, we focus on examining the impact of L3 cache size, processor count,
and block size on the OLTP benchmark. Figure 5.12 shows the effect of increasing the cache
size, using two-way set associative caches, which reduces the large number of conflict misses.
The execution time is improved as the L3 cache grows due to the reduction in L3 misses. Sur-
prisingly, almost all of the gain occurs in going from 1 to 2 MB, with litle additional gain bey-
ond that, despite the fact that cache misses are still a cause of significant performance loss with
2 MB and 4 MB caches. The question is, Why?
 
Search WWH ::




Custom Search