"$" stands for U.S. dollars (i.e., money). Another word for paper money is cash, which sounds
just like the way Americans pronounce cache. Aren't hardware designers funny?
The distinction between unified caches and divided caches (one section for instructions, a
different section for data) is not particularly interesting for what we're doing.
The store buffer is a small, specialized cache that holds words the CPU is writing out to memory.
The idea is that instead of requiring the CPU to stall while a write is going on (it takes 30100
cycles), the word will be placed into the store buffer, which will then arrange to write the word out
to main memory when it sees fit. This way the CPU can run at full speed, not worrying about
exactly when a word arrives in main memory.
Of course, the store buffer must be closely coupled with the I$ and memory fetch unit to ensure
that the CPU has a coherent view of memory. It wouldn't do for CPU 0 to write x1234545F into
location x00000010, then load x00000010 and not see x1234545F. Hardware architects take
care of that, so we don't have to bother. The other issue with using a store buffer is that of
determining when writes arrive in main memory. CPU 0 might write out dozens of words, placing
them in the store buffer, while CPU 1, which then accesses those words, wouldn't see the changes,
because the store buffer hasn't written them out yet. This is problem 2.
Just to further complicate the hardware picture, it is possible for the hardware designers to give the
store buffer more latitude in its choice of which words to write out when. Total store order refers
to a design that requires the store buffer to write words to main memory in the same order as the
instruction stream. It can be more efficient for the store buffer to write words out in a different
order (perhaps it can write a series of contiguous words out together; perhaps it can write a word
to memory bank 1, then memory bank 2). There are a variety of schemes for this out-of-order
writing (partial store order, weak order, etc.). The importance to us is that we must not rely on
write order! This is problem 3.
One more complication is that CPUs might do out-of-order execution, too. If a CPU has to wait
for a cache fill before executing instruction 1, it is allowed to look at instruction 2. If there is no
dependency on 1, the CPU may proceed to execute 2 first. This is a wonderful thing for hardware
architects, as it gives them enormous leeway in their designs, allowing the CPU to run at
maximum possible speeds. It also complicates CPU design, ensuring full employment for
hardware designers. For us software types, it means that we cannot rely on order of execution.
Also problem 3.
There are some fancy algorithms, such as Decker's algorithm, which avoid using mutexes by
depending upon the write order of CPUs. These techniques will not work on modern SMP machines.
Figure 16-1 shows a typical SMP system. Each CPU has its own on-chip "I$" and store buffer. It
also has a much larger, off-chip E$. All external communication is done over a single memory bus.
Part of the memory bus protocol for all these machines is that each CPU will do bus snooping.
Every memory transaction will be observed by every bus snooper, and every time that CPU 0
writes a word out to main memory, every other bus snooper will see it and invalidate that entry
in its own caches (both "E$" and "I$"). The next time that CPU 1 wants to use that word, it will
look in its own cache, see that the entry has been marked invalid, and go out to main memory to
get the correct value.
There are other schemes for dealing with this problem, such as cache broadcast, which simply
sends out the updated value immediately, but this won't change our programming decisions.
Figure 16-1. SMP System Architecture
Search WWH :