Hardware Reference
In-Depth Information
Assume that L2 has a block size four times that of L1. Show how a miss for an
address that causes a replacement in L1 and L2 can lead to violation of the in-
clusion property.
Answer
Assume that L1 and L2 are direct mapped and that the block size of L1 is b bytes
and the block size of L2 is 4 b bytes. Suppose L1 contains two blocks with start-
ing addresses x and x + b and that x mod 4 b = 0, meaning that x also is the start-
ing address of a block in L2; then that single block in L2 contains the L1 blocks
x , x + b , x + 2 b , and x + 3 b . Suppose the processor generates a reference to block
y that maps to the block containing x in both caches and hence misses. Since L2
missed, it fetches 4 b bytes and replaces the block containing x , x + b , x + 2 b , and
x + 3 b , while L1 takes b bytes and replaces the block containing x . Since L1 still
contains x + b , but L2 does not, the inclusion property no longer holds.
To maintain inclusion with multiple block sizes, we must probe the higher levels of the hier-
archy when a replacement is done at the lower level to ensure that any words replaced in the
lower level are invalidated in the higher-level caches; different levels of associativity create the
same sort of problems. In 2011, designers still appear to be split on the enforcement of inclu-
sion. Baer and Wang [1988] described the advantages and challenges of inclusion in detail. The
Intel i7 uses inclusion for L3, meaning that L3 always includes the contents of all of L2 and L1.
This allows them to implement a straightforward directory scheme at L3 and to minimize the
interference from snooping on L1 and L2 to those circumstances where the directory indicates
that L1 or L2 have a cached copy. The AMD Opteron, in contrast, makes L2 inclusive of L1 but
has no such restriction for L3. They use a snooping protocol, but only needs to snoop at L2
unless there is a hit, in which case a snoop is sent to L1.
Performance Gains From Using Multiprocessing And
Multithreading
In this section, we look at two different studies of the effectiveness of using multithreading on
a multicore processor; we will return to this topic in the next section, when we examine the
performance of the Intel i7. Our two studies are based on the Sun T1, which we introduced in
Chapter 3 , and the IBM Power5 processor.
We look at the performance of the T1 multicore using the same three server-oriented
benchmarks—TPC-C, SPECJBB (the SPEC Java Business Benchmark), and SPECWeb99—that
we examined in Chapter 3 . The SPECWeb99 benchmark is only run on a four-core version of
T1 because it cannot scale to use the full 32 threads of an eight-core processor; the other two
benchmarks are run with eight cores and four threads each for a total of 32 threads. Figure
5.25 shows the per-thread and per-core CPIs and the effective CPI and instructions per clock
(IPC) for the eight-core T1.
Search WWH ::




Custom Search