We really don't expect you to deal with these fine-grained optimizations. We don't. They involve a
lot of careful estimation and painstaking verification, and they have to be tailored to individual
machines. But this kind of thing is possible, it does yield impressive improvements for some
programs, and the truly high-performance obsessive types do it. (Dakota Scientific's numerical
libraries take all of these parameters into account and get impressive results.)
What if you had a large number of records about people--names, ages, salaries, addresses,
favorite programming languages, etc.? To calculate the average salary for these folks, you would
have to bring in the cache block with the first person's salary in it (along with seven other words),
add that to the total, then bring in the next person's salary, etc. Each cache miss would bring in
exactly one piece of useful data, and every salary would require a cache miss.
If you organized the data differently, placing all of the salaries into one array, all of the names in
another, etc., you would be able to make much better use of each cache load. Instead of one salary
being loaded with each miss, you'd get eight, significantly reducing cache wait times.
This is not something you'd do for a casual program. When you have this kind of program design
and data usage, and you are desperate for optimal performance, that's when you do this kind of
What is the minimum-size data item that you can write to memory? On most modern machines it's
8 bits. On some it's 32. It's possible that on some machines it could be 64!
Now, what would happen if you used one lock to protect the first bit in a word, and another lock to
protect the second? It wouldn't work. Every time you wrote out bit 1, you would overwrite bit 2,
and if someone else was using bit 2, ... Too bad.
Don't do that. Happily, it is easy to avoid word tearing and it would be a pretty odd program
indeed that actually violated this restriction.
A cache memory is divided up into cache lines (typically, eight words) which are loaded and
tracked as a unit. If one word in the line is required, all eight are loaded. If one word is written out
by another CPU, the entire line is invalidated. Cache lines are based on the idea that if one word is
accessed, it's very likely that the next word will be also. Normally, this works quite well and yields
excellent performance. Sometimes it can work against you.
If eight integers happened to be located contiguously at a line boundary, and if eight different
threads on eight different CPUs happened to use those (unshared) integers extensively, we could
run into a problem. CPU 0 would write a. This would, of course, cause the a cache line to
be invalidated on all the other CPUs. CPU 1 now wishes to read a. Even though it actually has
a valid copy of a in cache, the line has been marked invalid, so CPU 1 must reload that cache
line. And when CPU 1 writes a, CPU 0 will invalidate its cache line, etc., etc.
This is what is called false sharing. On an 8-way, 244-MHz UE4000, the program shown in Code
Example 16-3 runs in 100 s when the integers are adjacent (SEPARATION == 1), and in 10 s
when the integers are distant (SEPARATION == 16). It is an unlikely problem (it can happen,
however), one that you wouldn't even look for unless you did some careful performance tuning
and noticed extensive CPU stalls. Without specialized memory tools, the only way you could find
Search WWH :