Hardware Reference
In-Depth Information
signals to inform the processor that this store failed. Design such a monitor for a memory
system supporting a four-core symmetric multiprocessor (SMP). Take into account that,
generally, read and write requests can have different data sizes (4, 8, 16, 32 bytes). Any
memory location can be the target of a load-linked/store-conditional pair, and the memory
monitor should assume that load-linked/store-conditional references to any location can, pos-
sibly, be interleaved with regular accesses to the same location. The monitor complexity
should be independent of the memory size.
5.32 [10/12/10/12] <5.6> As discussed in Section 5.6 the memory consistency model provides
a specification of how the memory system will appear to the programmer. Consider the
following code segment, where the initial values are
A=flag=C=0.
P1
P2
A= 2000
while (flag ==1){;}
flag=1
C=A
a. [10] <5.6> At the end of the code segment, what is the value you would expect for C?
b. [12] <5.6> A system with a general-purpose interconnection network, a directory-
based cache coherence protocol, and support for nonblocking loads generates a result
where C is 0. Describe a scenario where this result is possible.
c. [10] <5.6> If you wanted to make the system sequentially consistent, what are the key
constraints you would need to impose?
Assume that a processor supports a relaxed memory consistency model. A relaxed consistency
model requires synchronization to be explicitly identified. Assume that the processor supports
a “barrier” instruction, which ensures that all memory operations preceding the barrier in-
struction complete before any memory operations following the barrier are allowed to begin.
Where would you include barrier instructions in the above code segment to ensure that you
get the “intuitive results” of sequential consistency?
5.33 [25] <5.7> Prove that in a two-level cache hierarchy, where L1 is closer to the processor,
inclusion is maintained with no extra action if L2 has at least as much associativity as L1,
both caches use line replaceable unit (LRU) replacement, and both caches have the same
block sizes.
5.34 [Discussion] <5.7> When trying to perform detailed performance evaluation of a multi-
processor system, system designers use one of three tools: analytical models, trace-driven
simulation, and execution-driven simulation. Analytical models use mathematical expres-
sions to model the behavior of programs. Trace-driven simulations run the applications
on a real machine and generate a trace, typically of memory operations. These traces can
be replayed through a cache simulator or a simulator with a simple processor model to
predict the performance of the system when various parameters are changed. Execution-
driven simulators simulate the entire execution maintaining an equivalent structure for the
processor state and so on. What are the accuracy and speed trade-offs between these ap-
proaches?
5.35 [40] <5.7, 5.9> Multiprocessors and clusters usually show performance increases as you
increase the number of the processors, with the ideal being n × speedup for n processors.
The goal of this biased benchmark is to make a program that gets worse performance as
you add processors. This means, for example, that one processor on the multiprocessor or
cluster runs the program fastest, two are slower, four are slower than two, and so on. What
Search WWH ::




Custom Search