Hardware Reference
In-Depth Information
Cache Coherent NUMA Multiprocessors
Multiprocessor designs such as that of Fig. 8-32 do not scale well because they
do not do caching. Having to go to the remote memory every time a nonlocal
memory word is accessed is a major performance hit. However, if caching is
added, then cache coherence must also be added. One way to provide cache coher-
ence is to snoop on the system bus. Technically, doing this is not difficult, but
beyond a certain number of CPUs, it becomes infeasible. To build really large
multiprocessors, a fundamentally different approach is needed.
The most popular approach for building large CC-NUMA ( Cache Coherent
NUMA ) multiprocessors currently is the directory-based multiprocessor . The
idea is to maintain a database telling where each cache line is and what its status is.
When a cache line is referenced, the database is queried to find out where it is and
whether it is clean or dirty (modified). Since this database must be queried on
every single instruction that references memory, it must be kept in extremely fast
special-purpose hardware that can respond in a fraction of a bus cycle.
To make the idea of a directory-based multiprocessor somewhat more concrete,
let us consider a simple (hypothetical) example, a 256-node system, each node
consisting of one CPU and 16 MB of RAM connected to the CPU via a local bus.
The total memory is 2 32 bytes, divided up into 2 26 cache lines of 64 bytes each.
The memory is statically allocated among the nodes, with 0-16M in node 0,
16-32M in node 1, and so on. The nodes are connected by an interconnection net-
work, as shown in Fig. 8-33(a). This network could be a grid, hypercube, or other
topology. Each node also holds the directory entries for the 2 18 64-byte cache lines
comprising its 2 24 -byte memory. For the moment, we will assume that a line can
be held in at most one cache.
To see how the directory works, let us trace a LOAD instruction from CPU 20
that references a cached line. First the CPU issuing the instruction presents it to its
MMU, which translates it to a physical address, say, 0x24000108. The MMU
splits this address into the three parts shown in Fig. 8-33(b). In decimal, the three
parts are node 36, line 4, and offset 8. The MMU sees that the memory word refer-
enced is from node 36, not node 20, so it sends a request message through the
interconnection network to the line's home node, 36, asking whether its line 4 is
cached, and if so, where.
When the request arrives at node 36 over the interconnection network, it is
routed to the directory hardware. The hardware indexes into its table of 2 18 entries,
one for each of its cache lines and extracts entry 4. From Fig. 8-33(c) we see that
the line is not cached, so the hardware fetches line 4 from the local RAM, sends it
back to node 20, and updates directory entry 4 to indicate that the line is now
cached at node 20.
Now let us consider a second request, this time asking about node 36's line 2.
From Fig. 8-33(c) we see that this line is cached at node 82. At this point the hard-
ware could update directory entry 2 to say that the line is now at node 20 and then
 
Search WWH ::




Custom Search