Thread-Level Parallelism - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

lock. Here is the code sequence to lock a spin lock whose address is in R1 using an atomic ex-

change:

DADDUIR2,R0,#1

lockit: EXCHR2,0(R1) ;atomic exchange

BNEZR2,lockit ;already locked?

If our multiprocessor supports cache coherence, we can cache the locks using the coherence

mechanism to maintain the lock value coherently. Caching locks has two advantages. First, it

allows an implementation where the process of “spinning” (trying to test and acquire the lock

in a tight loop) could be done on a local cached copy rather than requiring a global memory

access on each atempt to acquire the lock. The second advantage comes from the observation

that there is often locality in lock accesses; that is, the processor that used the lock last will use

it again in the near future. In such cases, the lock value may reside in the cache of that pro-

cessor, greatly reducing the time to acquire the lock.

Obtaining the first advantage—being able to spin on a local cached copy rather than gener-

ating a memory request for each atempt to acquire the lock—requires a change in our simple

spin procedure. Each atempt to exchange in the loop directly above requires a write opera-

tion. If multiple processors are atempting to get the lock, each will generate the write. Most of

these writes will lead to write misses, since each processor is trying to obtain the lock variable

in an exclusive state.

Thus, we should modify our spin lock procedure so that it spins by doing reads on a local

copy of the lock until it successfully sees that the lock is available. Then it atempts to acquire

the lock by doing a swap operation. A proces-sor first reads the lock variable to test its state.

A processor keeps reading and testing until the value of the read indicates that the lock is un-

locked. The processor then races against all other processes that were similarly “spin waiting”

to see who can lock the variable first. All processes use a swap instruction that reads the old

value and stores a 1 into the lock variable. The single winner will see the 0, and the losers will

see a 1 that was placed there by the winner. (The losers will con-tinue to set the variable to the

locked value, but that doesn't mater.) The win-ning processor executes the code after the lock

and, when finished, stores a 0 into the lock variable to release the lock, which starts the race

all over again. Here is the code to perform this spin lock (remember that 0 is unlocked and 1 is

locked):

lockit: LDR2,0(R1) ;load of lock

BNEZR2,lockit ;not available-spin

DADDUIR2,R0,#1 ;load locked value

EXCHR2,0(R1) ;swap

BNEZR2,lockit ;branch if lock wasn't 0

Let's examine how this “spin lock” scheme uses the cache coherence mechanisms. Figure

5.24 shows the processor and bus or directory operations for multiple processes try-ing to lock

a variable using an atomic swap. Once the processor with the lock stores a 0 into the lock, all

other caches are invalidated and must fetch the new value to update their copy of the lock.

One such cache gets the copy of the unlocked value (0) first and performs the swap. When the

cache miss of other processors is satisfied, they find that the variable is already locked, so they

must return to testing and spinning.

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home