PARALLEL COMPUTER ARCHITECTURES - Structured Computer Organization

Hardware Reference

In-Depth Information

Cache Coherent NUMA Multiprocessors

Multiprocessor designs such as that of Fig. 8-32 do not scale well because they

do not do caching. Having to go to the remote memory every time a nonlocal

memory word is accessed is a major performance hit. However, if caching is

added, then cache coherence must also be added. One way to provide cache coher-

ence is to snoop on the system bus. Technically, doing this is not difficult, but

beyond a certain number of CPUs, it becomes infeasible. To build really large

multiprocessors, a fundamentally different approach is needed.

The most popular approach for building large CC-NUMA ( Cache Coherent

NUMA ) multiprocessors currently is the directory-based multiprocessor . The

idea is to maintain a database telling where each cache line is and what its status is.

When a cache line is referenced, the database is queried to find out where it is and

whether it is clean or dirty (modified). Since this database must be queried on

every single instruction that references memory, it must be kept in extremely fast

special-purpose hardware that can respond in a fraction of a bus cycle.

To make the idea of a directory-based multiprocessor somewhat more concrete,

let us consider a simple (hypothetical) example, a 256-node system, each node

consisting of one CPU and 16 MB of RAM connected to the CPU via a local bus.

The total memory is 2 32 bytes, divided up into 2 26 cache lines of 64 bytes each.

The memory is statically allocated among the nodes, with 0-16M in node 0,

16-32M in node 1, and so on. The nodes are connected by an interconnection net-

work, as shown in Fig. 8-33(a). This network could be a grid, hypercube, or other

topology. Each node also holds the directory entries for the 2 18 64-byte cache lines

comprising its 2 24 -byte memory. For the moment, we will assume that a line can

be held in at most one cache.

To see how the directory works, let us trace a LOAD instruction from CPU 20

that references a cached line. First the CPU issuing the instruction presents it to its

MMU, which translates it to a physical address, say, 0x24000108. The MMU

splits this address into the three parts shown in Fig. 8-33(b). In decimal, the three

parts are node 36, line 4, and offset 8. The MMU sees that the memory word refer-

enced is from node 36, not node 20, so it sends a request message through the

interconnection network to the line's home node, 36, asking whether its line 4 is

cached, and if so, where.

When the request arrives at node 36 over the interconnection network, it is

routed to the directory hardware. The hardware indexes into its table of 2 18 entries,

one for each of its cache lines and extracts entry 4. From Fig. 8-33(c) we see that

the line is not cached, so the hardware fetches line 4 from the local RAM, sends it

back to node 20, and updates directory entry 4 to indicate that the line is now

cached at node 20.

Now let us consider a second request, this time asking about node 36's line 2.

From Fig. 8-33(c) we see that this line is cached at node 82. At this point the hard-

ware could update directory entry 2 to say that the line is now at node 20 and then

Search WWH ::

Custom Search

Home