Hardware Reference
In-Depth Information
ization enforced by the bus also serializes their writes. One implication of this scheme is that a
write to a shared data item cannot actually complete until it obtains bus access. All coherence
schemes require some method of serializing accesses to the same cache block, either by serial-
izing access to the communication medium or another shared structure.
In addition to invalidating outstanding copies of a cache block that is being writen into, we
also need to locate a data item when a cache miss occurs. In a write-through cache, it is easy to
ind the recent value of a data item, since all writen data are always sent to the memory, from
which the most recent value of a data item can always be fetched. (Write buffers can lead to
some additional complexities and must effectively be treated as additional cache entries.)
For a write-back cache, the problem of finding the most recent data value is harder, since
the most recent value of a data item can be in a private cache rather than in the shared cache or
memory. Happily, write-back caches can use the same snooping scheme both for cache misses
and for writes: Each processor snoops every address placed on the shared bus. If a processor
inds that it has a dirty copy of the requested cache block, it provides that cache block in re-
sponse to the read request and causes the memory (or L3) access to be aborted. The additional
complexity comes from having to retrieve the cache block from another processor's private
cache (L1 or L2), which can often take longer than retrieving it from L3. Since write-back
caches generate lower requirements for memory bandwidth, they can support larger numbers
of faster processors. As a result, all multicore processors use write-back at the outermost levels
of the cache, and we will examine the implementation of coherence with write-back caches.
The normal cache tags can be used to implement the process of snooping, and the valid bit
for each block makes invalidation easy to implement. Read misses, whether generated by an
invalidation or by some other event, are also straightforward since they simply rely on the
snooping capability. For writes we would like to know whether any other copies of the block
are cached because, if there are no other cached copies, then the write need not be placed on
the bus in a write-back cache. Not sending the write reduces both the time to write and the
required bandwidth.
To track whether or not a cache block is shared, we can add an extra state bit associated with
each cache block, just as we have a valid bit and a dirty bit. By adding a bit indicating whether
the block is shared, we can decide whether a write must generate an invalidate. When a write
to a block in the shared state occurs, the cache generates an invalidation on the bus and marks
the block as exclusive . No further invalidations will be sent by that core for that block. The core
with the sole copy of a cache block is normally called the owner of the cache block.
When an invalidation is sent, the state of the owner's cache block is changed from shared to
unshared (or exclusive). If another processor later requests this cache block, the state must be
made shared again. Since our snooping cache also sees any misses, it knows when the exclus-
ive cache block has been requested by another processor and the state should be made shared.
Every bus transaction must check the cache-address tags, which could potentially interfere
with processor cache accesses. One way to reduce this interference is to duplicate the tags and
have snoop accesses directed to the duplicate tags. Another approach is to use a directory
at the shared L3 cache; the directory indicates whether a given block is shared and possibly
which cores have copies. With the directory information, invalidates can be directed only to
those caches with copies of the cache block. This requires that L3 must always have a copy of
any data item in L1 or L2, a property called inclusion , which we will return to in Section 5.7 .
Search WWH ::




Custom Search