Thread-Level Parallelism - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

signals to inform the processor that this store failed. Design such a monitor for a memory

system supporting a four-core symmetric multiprocessor (SMP). Take into account that,

generally, read and write requests can have different data sizes (4, 8, 16, 32 bytes). Any

memory location can be the target of a load-linked/store-conditional pair, and the memory

monitor should assume that load-linked/store-conditional references to any location can, pos-

sibly, be interleaved with regular accesses to the same location. The monitor complexity

should be independent of the memory size.

5.32 [10/12/10/12] <5.6> As discussed in Section 5.6 the memory consistency model provides

a specification of how the memory system will appear to the programmer. Consider the

following code segment, where the initial values are

A=flag=C=0.

P1

P2

A= 2000

while (flag ==1){;}

flag=1

C=A

a. [10] <5.6> At the end of the code segment, what is the value you would expect for C?

b. [12] <5.6> A system with a general-purpose interconnection network, a directory-

based cache coherence protocol, and support for nonblocking loads generates a result

where C is 0. Describe a scenario where this result is possible.

c. [10] <5.6> If you wanted to make the system sequentially consistent, what are the key

constraints you would need to impose?

Assume that a processor supports a relaxed memory consistency model. A relaxed consistency

model requires synchronization to be explicitly identified. Assume that the processor supports

a “barrier” instruction, which ensures that all memory operations preceding the barrier in-

struction complete before any memory operations following the barrier are allowed to begin.

Where would you include barrier instructions in the above code segment to ensure that you

get the “intuitive results” of sequential consistency?

5.33 [25] <5.7> Prove that in a two-level cache hierarchy, where L1 is closer to the processor,

inclusion is maintained with no extra action if L2 has at least as much associativity as L1,

both caches use line replaceable unit (LRU) replacement, and both caches have the same

block sizes.

5.34 [Discussion] <5.7> When trying to perform detailed performance evaluation of a multi-

processor system, system designers use one of three tools: analytical models, trace-driven

simulation, and execution-driven simulation. Analytical models use mathematical expres-

sions to model the behavior of programs. Trace-driven simulations run the applications

on a real machine and generate a trace, typically of memory operations. These traces can

be replayed through a cache simulator or a simulator with a simple processor model to

predict the performance of the system when various parameters are changed. Execution-

driven simulators simulate the entire execution maintaining an equivalent structure for the

processor state and so on. What are the accuracy and speed trade-offs between these ap-

proaches?

5.35 [40] <5.7, 5.9> Multiprocessors and clusters usually show performance increases as you

increase the number of the processors, with the ideal being n × speedup for n processors.

The goal of this biased benchmark is to make a program that gets worse performance as

you add processors. This means, for example, that one processor on the multiprocessor or

cluster runs the program fastest, two are slower, four are slower than two, and so on. What

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home