Hardware Reference
In-Depth Information
64-byte (512-bit) cache line implies an overhead of over 50 percent. A third possi-
bility is to keep one 8-bit field in each directory entry and use it as the head of a
linked list that threads all the copies of the cache line together. This strategy re-
quires extra storage at each node for the linked list pointers, and it also requires
following a linked list to find all the copies when that is needed. Each possibility
has its own advantages and disadvantages, and all three have been used in real sys-
tems.
Another improvement to the directory design is to keep track of whether the
cache line is clean (home memory is up to date) or dirty (home memory is not up
to date). If a read request comes in for a clean cache line, the home node can sat-
isfy the request from memory, without having to forward it to a cache. A read re-
quest for a dirty cache line, however, must be forwarded to the node holding the
cache line because only it has a valid copy. If only one cache copy is allowed, as
in Fig. 8-33, there is no real advantage to keeping track of its cleanliness, because
any new request requires a message to be sent to the existing copy to invalidate it.
Of course, keeping track of whether each cache line is clean or dirty implies
that when a cache line is modified, the home node has to be informed, even if only
one cache copy exists. If multiple copies exist, modifying one of them requires the
rest to be invalidated, so some protocol is needed to avoid race conditions. For ex-
ample, to modify a shared cache line, one of the holders might have to request
exclusive access before modifying it. Such a request would cause all other copies
to be invalidated before permission was granted. Other performance optimizations
for CC-NUMA machines are discussed in Cheng and Carter (2008).
The Sun Fire E25K NUMA Multiprocessor
As an example of a shared-memory NUMA multiprocessor, let us study the
Sun Microsystems Sun Fire family. Although it contains various models, we will
focus on the E25K, which has 72 UltraSPARC IV CPU chips. An UltraSPARC IV
is essentially a pair of UltraSPARC III processors that share a common cache and
memory. The E15K is essentially the same system except with uniprocessor in-
stead of dual-processor CPU chips. Smaller members exist as well, but from our
point of view, what is interesting is how the one with the most CPUs works.
The E25K system consists of up to 18 boardsets, each boardset consisting of a
CPU-memory board, an I/O board with four PCI slots, and an expander board that
couples the CPU-memory board with the I/O board and joins the pair to the center-
plane, which holds the boards and contains the switching logic. Each CPU-memo-
ry board contains four CPU chips and four 8-GB RAM modules. Consequently,
each CPU-memory board on the E25K holds eight CPUs and 32 GB of RAM (four
CPUs and four 32 GB of RAM on the E15K). A full E25K thus contains 144
CPUs, 576 GB of RAM, and 72 PCI slots. It is illustrated in Fig. 8-34. Inter-
estingly enough, the number 18 was chosen due to packaging constraints: a system
with 18 boardsets was the largest one that could fit through a doorway in one piece.
 
Search WWH ::




Custom Search