Thread-Level Parallelism - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

Assume that L2 has a block size four times that of L1. Show how a miss for an

address that causes a replacement in L1 and L2 can lead to violation of the in-

clusion property.

Answer

Assume that L1 and L2 are direct mapped and that the block size of L1 is b bytes

and the block size of L2 is 4 b bytes. Suppose L1 contains two blocks with start-

ing addresses x and x + b and that x mod 4 b = 0, meaning that x also is the start-

ing address of a block in L2; then that single block in L2 contains the L1 blocks

x , x + b , x + 2 b , and x + 3 b . Suppose the processor generates a reference to block

y that maps to the block containing x in both caches and hence misses. Since L2

missed, it fetches 4 b bytes and replaces the block containing x , x + b , x + 2 b , and

x + 3 b , while L1 takes b bytes and replaces the block containing x . Since L1 still

contains x + b , but L2 does not, the inclusion property no longer holds.

To maintain inclusion with multiple block sizes, we must probe the higher levels of the hier-

archy when a replacement is done at the lower level to ensure that any words replaced in the

lower level are invalidated in the higher-level caches; different levels of associativity create the

same sort of problems. In 2011, designers still appear to be split on the enforcement of inclu-

sion. Baer and Wang [1988] described the advantages and challenges of inclusion in detail. The

Intel i7 uses inclusion for L3, meaning that L3 always includes the contents of all of L2 and L1.

This allows them to implement a straightforward directory scheme at L3 and to minimize the

interference from snooping on L1 and L2 to those circumstances where the directory indicates

that L1 or L2 have a cached copy. The AMD Opteron, in contrast, makes L2 inclusive of L1 but

has no such restriction for L3. They use a snooping protocol, but only needs to snoop at L2

unless there is a hit, in which case a snoop is sent to L1.

Performance Gains From Using Multiprocessing And

Multithreading

In this section, we look at two different studies of the effectiveness of using multithreading on

a multicore processor; we will return to this topic in the next section, when we examine the

performance of the Intel i7. Our two studies are based on the Sun T1, which we introduced in

Chapter 3 , and the IBM Power5 processor.

We look at the performance of the T1 multicore using the same three server-oriented

benchmarks—TPC-C, SPECJBB (the SPEC Java Business Benchmark), and SPECWeb99—that

we examined in Chapter 3 . The SPECWeb99 benchmark is only run on a four-core version of

T1 because it cannot scale to use the full 32 threads of an eight-core processor; the other two

benchmarks are run with eight cores and four threads each for a total of 32 threads. Figure

5.25 shows the per-thread and per-core CPIs and the effective CPI and instructions per clock

(IPC) for the eight-core T1.

Search WWH ::

Custom Search

Home