Thread-Level Parallelism - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

cache hierarchy, is very similar to the multicore Intel i7 and other processors, as shown in Fig-

ure 5.9 . In particular, the Alpha caches are somewhat smaller, but the miss times are also lower

than on an i7. Thus, the behavior of the Alpha system should provide interesting insights into

the behavior of modern multicore designs.

FIGURE 5.9 The characteristics of the cache hierarchy of the Alpha 21164 used in this

study and the Intel i7 . Although the sizes are larger and the associativity is higher on the i7,

the miss penalties are also higher, so the behavior may differ only slightly. For example, from

Appendix B , we can estimate the miss rates of the smaller Alpha L1 cache as 4.9% and 3%

for the larger i7 L1 cache, so the average L1 miss penalty per reference is 0.34 for the Alpha

and 0.30 for the i7. Both systems have a high penalty (125 cycles or more) for a transfer re-

quired from a private cache. The i7 also shares its L3 among all the cores.

The workload used for this study consists of three applications:

1. An online transaction-processing (OLTP) workload modeled after TPC-B (which has

memory behavior similar to its newer cousin TPC-C, described in Chapter 1 ) and using

Oracle 7.3.2 as the underlying database. The workload consists of a set of client processes

that generate requests and a set of servers that handle them. The server processes consume

85% of the user time, with the remaining going to the clients. Although the I/O latency is

hidden by careful tuning and enough requests to keep the processor busy, the server pro-

cesses typically block for I/O after about 25,000 instructions.

2. A decision support system (DSS) workload based on TPC-D, the older cousin of the heavily

used TPC-E, which also uses Oracle 7.3.2 as the underlying database. The workload in-

cludes only 6 of the 17 read queries in TPC-D, although the 6 queries examined in the

benchmark span the range of activities in the entire benchmark. To hide the I/O latency,

parallelism is exploited both within queries, where parallelism is detected during a query

formulation process, and across queries. Blocking calls are much less frequent than in the

OLTP benchmark; the 6 queries average about 1.5 million instructions before blocking.

Search WWH ::

Custom Search

Home