Thread-Level Parallelism - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

Suppose we have a processor where a write miss takes 50 cycles to establish

ownership, 10 cycles to issue each invalidate after ownership is established, and

80 cycles for an invalidate to complete and be acknowl-edged once it is issued.

Assuming that four other processors share a cache block, how long does a write

miss stall the writing processor if the processor is sequentially consistent? As-

sume that the invalidates must be explicitly acknowledged before the coheren-

ce controller knows they are completed. Suppose we could continue executing

after obtaining ownership for the write miss without waiting for the invalidates;

how long would the write take?

Answer

When we wait for invalidates, each write takes the sum of the ownership time

plus the time to complete the invalidates. Since the invalidates can overlap, we

need only worry about the last one, which starts 10 + 10 + 10 + 10 = 40 cycles after

ownership is established. Hence, the total time for the write is 50 + 40 + 80 = 170

cycles. In comparison, the ownership time is only 50 cycles. With appropriate

write buffer implementations, it is even possible to continue before ownership

is established.

To provide beter performance, researchers and architects have explored two diferent

routes. First, they developed ambitious implementations that preserve sequential consistency

but use latency-hiding techniques to reduce the penalty; we discuss these in Section 5.7 . Se-

cond, they developed less restrictive memory consistency models that allow for faster hard-

ware. Such models can affect how the programmer sees the multiprocessor, so before we dis-

cuss these less restrictive models, let's look at what the programmer expects.

The Programmer's View

Although the sequential consistency model has a performance disadvantage, from the view-

point of the programmer it has the advantage of simplicity. The challenge is to develop a pro-

gramming model that is simple to explain and yet allows a high-performance implementation.

One such programming model that allows us to have a more efficient implementation is to

assume that programs are synchronized . A program is synchronized if all accesses to shared

data are ordered by synchronization operations. A data reference is ordered by a synchroniz-

ation operation if, in every possible execution, a write of a variable by one processor and an

access (either a read or a write) of that variable by another processor are separated by a pair

of synchronization operations, one executed after the write by the writing processor and one

executed before the access by the second processor. Cases where variables may be updated

without ordering by synchronization are called data races because the execution outcome de-

pends on the relative speed of the processors, and, like races in hardware design, the outcome

is unpredictable, which leads to another name for synchronized programs: data-race-free .

As a simple example, consider a variable being read and updated by two different pro-

cessors. Each processor surrounds the read and update with a lock and an unlock, both to en-

sure mutual exclusion for the update and to ensure that the read is consistent. Clearly, every

write is now separated from a read by the other processor by a pair of synchronization opera-

Search WWH ::

Custom Search

Home