Hardware Reference
In-Depth Information
placement hints are used. Such messages tell the controller that a block has been replaced.
Modify the directory coherence protocol of Section 5.4 to use such replacement hints.
5.28 [20/30] <5.4> One downside of a straightforward implementation of directories using
fully populated bit vectors is that the total size of the directory information scales as the
product (i.e., processor count × memory blocks). If memory is grown linearly with pro-
cessor count, the total size of the directory grows quadratically in the processor count. In
practice, because the directory needs only 1 bit per memory block (which is typically 32 to
128 bytes), this problem is not serious for small to moderate processor counts. For example,
assuming a 128-byte block, the amount of directory storage compared to main memory is
the processor count/1024, or about 10% additional storage with 100 processors. This prob-
lem can be avoided by observing that we only need to keep an amount of information that
is proportional to the cache size of each processor. We explore some solutions in these ex-
ercises.
a. [20] <5.4> One method to obtain a scalable directory protocol is to organize the mul-
tiprocessor as a logical hierarchy with the processors as leaves of the hierarchy and
directories positioned at the root of each subtree. The directory at each subtree records
which descendants cache which memory blocks, as well as which memory blocks with
a home in that subtree are cached outside the subtree. Compute the amount of stor-
age needed to record the processor information for the directories, assuming that each
directory is fully associative. Your answer should also incorporate both the number of
nodes at each level of the hierarchy as well as the total number of nodes.
b. [30] <5.4> An alternative approach to implementing directory schemes is to implement
bit vectors that are not dense. There are two strategies; one reduces the number of bit
vectors needed, and the other reduces the number of bits per vector. Using traces, you
can compare these schemes. First, implement the directory as a four-way set associ-
ative cache storing full bit vectors, but only for the blocks that are cached outside the
home node. If a directory cache miss occurs, choose a directory entry and invalidate
the entry. Second, implement the directory so that every entry has 8 bits. If a block
is cached in only one node outside its home, this field contains the node number. If
the block is cached in more than one node outside its home, this field is a bit vector,
with each bit indicating a group of eight processors, at least one of which caches the
block. Using traces of 64-processor execution, simulate the behavior of these schemes.
Assume a perfect cache for nonshared references so as to focus on coherency behavi-
for Determine the number of extraneous invalidations as the directory cache size in
increased.
5.29 [10] <5.5> Implement the classical test-and-set instruction using the load-linked/store-con-
ditional instruction pair.
5.30 [15] <5.5> One performance optimization commonly used is to pad synchronization vari-
ables to not have any other useful data in the same cache line as the synchronization vari-
able. Construct a pathological example when not doing this can hurt performance. Assume
a snooping write invalidate protocol.
5.31 [30] <5.5> One possible implementation of the load-linked/store-conditional pair for mul-
ticore processors is to constrain these instructions to using uncached memory operations.
A monitor unit intercepts all reads and writes from any core to the memory. It keeps
track of the source of the load-linked instructions and whether any intervening stores occur
between the load-linked and its corresponding store - conditional instruction. The monitor can
prevent any failing store conditional from writing any data and can use the interconnect
Search WWH ::




Custom Search