Thread-Level Parallelism - Computer Architecture: A Quantitative Approach

Hardware Reference

In-Depth Information

When the block is in the exclusive state, the current value of the block is held in a cache on the

node identified by the set Sharers (the owner), so there are three possible directory requests:

■ Read miss —The owner is sent a data fetch message, which causes the state of the block in

the owner's cache to transition to shared and causes the owner to send the data to the dir-

ectory, where it is writen to memory and sent back to the requesting processor. The iden-

tity of the requesting node is added to the set Sharers, which still contains the identity of

the processor that was the owner (since it still has a readable copy).

■ Data write-back —The owner is replacing the block and therefore must write it back. This

write-back makes the memory copy up to date (the home directory essentially becomes the

owner), the block is now uncached, and the Sharers set is empty.

■ Write miss —The block has a new owner. A message is sent to the old owner, causing the

cache to invalidate the block and send the value to the directory, from which it is sent to

the requesting node, which becomes the new owner. Sharers is set to the identity of the

new owner, and the state of the block remains exclusive.

This state transition diagram in Figure 5.23 is a simplification, just as it was in the snooping

cache case. In the case of a directory, as well as a snooping scheme implemented with a net-

work other than a bus, our protocols will need to deal with nonatomic memory transactions.

Appendix I explores these issues in depth.

The directory protocols used in real multiprocessors contain additional optimizations. In

particular, in this protocol when a read or write miss occurs for a block that is exclusive, the

block is first sent to the directory at the home node. From there it is stored into the home

memory and also sent to the original requesting node. Many of the protocols in use in com-

mercial multiprocessors forward the data from the owner node to the requesting node directly

(as well as performing the write-back to the home). Such optimizations often add complexity

by increasing the possibility of deadlock and by increasing the types of messages that must be

handled.

Implementing a directory scheme requires solving most of the same challenges we dis-

cussed for snooping protocols beginning on page 365. There are, however, new and additional

problems, which we describe in Appendix I. In Section 5.8 , we briefly describe how modern

multicores extend coherence beyond a single chip. The combinations of multichip coherence

and multicore coherence include all four possibilities of snooping/snooping (AMD Opteron),

snooping/directory, directory/snooping, and directory/directory!

5.5 Synchronization: The Basics

Synchronization mechanisms are typically built with user-level software routines that rely on

hardware-supplied synchronization instructions. For smaller multiprocessors or low-conten-

tion situations, the key hardware capability is an uninterruptible instruction or instruction

sequence capable of atomically retrieving and changing a value. Software synchronization

mechanisms are then constructed using this capability. In this section, we focus on the im-

plementation of lock and unlock synchronization operations. Lock and unlock can be used

straight-forwardly to create mutual exclusion, as well as to implement more complex syn-

chronization mechanisms.

In high-contention situations, synchronization can become a performance botleneck be-

cause contention introduces additional delays and because latency is potentially greater in

such a multiprocessor. We discuss how the basic synchronization mechanisms of this section

can be extended for large processor counts in Appendix I.

Computer Architecture: A Quantitative Approach

Search WWH ::

Custom Search

Home