Hardware Reference
In-Depth Information
Suppose we have a processor where a write miss takes 50 cycles to establish
ownership, 10 cycles to issue each invalidate after ownership is established, and
80 cycles for an invalidate to complete and be acknowl-edged once it is issued.
Assuming that four other processors share a cache block, how long does a write
miss stall the writing processor if the processor is sequentially consistent? As-
sume that the invalidates must be explicitly acknowledged before the coheren-
ce controller knows they are completed. Suppose we could continue executing
after obtaining ownership for the write miss without waiting for the invalidates;
how long would the write take?
Answer
When we wait for invalidates, each write takes the sum of the ownership time
plus the time to complete the invalidates. Since the invalidates can overlap, we
need only worry about the last one, which starts 10 + 10 + 10 + 10 = 40 cycles after
ownership is established. Hence, the total time for the write is 50 + 40 + 80 = 170
cycles. In comparison, the ownership time is only 50 cycles. With appropriate
write buffer implementations, it is even possible to continue before ownership
is established.
To provide beter performance, researchers and architects have explored two diferent
routes. First, they developed ambitious implementations that preserve sequential consistency
but use latency-hiding techniques to reduce the penalty; we discuss these in Section 5.7 . Se-
cond, they developed less restrictive memory consistency models that allow for faster hard-
ware. Such models can affect how the programmer sees the multiprocessor, so before we dis-
cuss these less restrictive models, let's look at what the programmer expects.
The Programmer's View
Although the sequential consistency model has a performance disadvantage, from the view-
point of the programmer it has the advantage of simplicity. The challenge is to develop a pro-
gramming model that is simple to explain and yet allows a high-performance implementation.
One such programming model that allows us to have a more efficient implementation is to
assume that programs are synchronized . A program is synchronized if all accesses to shared
data are ordered by synchronization operations. A data reference is ordered by a synchroniz-
ation operation if, in every possible execution, a write of a variable by one processor and an
access (either a read or a write) of that variable by another processor are separated by a pair
of synchronization operations, one executed after the write by the writing processor and one
executed before the access by the second processor. Cases where variables may be updated
without ordering by synchronization are called data races because the execution outcome de-
pends on the relative speed of the processors, and, like races in hardware design, the outcome
is unpredictable, which leads to another name for synchronized programs: data-race-free .
As a simple example, consider a variable being read and updated by two different pro-
cessors. Each processor surrounds the read and update with a lock and an unlock, both to en-
sure mutual exclusion for the update and to ensure that the read is consistent. Clearly, every
write is now separated from a read by the other processor by a pair of synchronization opera-
Search WWH ::




Custom Search