Data Reorganization - Word Tearing - False Sharing - Multithreaded Programming with JAVA

We really don't expect you to deal with these fine-grained optimizations. We don't. They involve a

lot of careful estimation and painstaking verification, and they have to be tailored to individual

machines. But this kind of thing is possible, it does yield impressive improvements for some

programs, and the truly high-performance obsessive types do it. (Dakota Scientific's numerical

libraries take all of these parameters into account and get impressive results.)

Data Reorganization

What if you had a large number of records about people--names, ages, salaries, addresses,

favorite programming languages, etc.? To calculate the average salary for these folks, you would

have to bring in the cache block with the first person's salary in it (along with seven other words),

add that to the total, then bring in the next person's salary, etc. Each cache miss would bring in

exactly one piece of useful data, and every salary would require a cache miss.

If you organized the data differently, placing all of the salaries into one array, all of the names in

another, etc., you would be able to make much better use of each cache load. Instead of one salary

being loaded with each miss, you'd get eight, significantly reducing cache wait times.

This is not something you'd do for a casual program. When you have this kind of program design

and data usage, and you are desperate for optimal performance, that's when you do this kind of

thing (see Portability ).

Word Tearing

What is the minimum-size data item that you can write to memory? On most modern machines it's

8 bits. On some it's 32. It's possible that on some machines it could be 64!

Now, what would happen if you used one lock to protect the first bit in a word, and another lock to

protect the second? It wouldn't work. Every time you wrote out bit 1, you would overwrite bit 2,

and if someone else was using bit 2, ... Too bad.

Don't do that. Happily, it is easy to avoid word tearing and it would be a pretty odd program

indeed that actually violated this restriction.

False Sharing

A cache memory is divided up into cache lines (typically, eight words) which are loaded and

tracked as a unit. If one word in the line is required, all eight are loaded. If one word is written out

by another CPU, the entire line is invalidated. Cache lines are based on the idea that if one word is

accessed, it's very likely that the next word will be also. Normally, this works quite well and yields

excellent performance. Sometimes it can work against you.

If eight integers happened to be located contiguously at a line boundary, and if eight different

threads on eight different CPUs happened to use those (unshared) integers extensively, we could

run into a problem. CPU 0 would write a[0]. This would, of course, cause the a[0] cache line to

be invalidated on all the other CPUs. CPU 1 now wishes to read a[1]. Even though it actually has

a valid copy of a[1] in cache, the line has been marked invalid, so CPU 1 must reload that cache

line. And when CPU 1 writes a[1], CPU 0 will invalidate its cache line, etc., etc.

This is what is called false sharing. On an 8-way, 244-MHz UE4000, the program shown in Code

Example 16-3 runs in 100 s when the integers are adjacent (SEPARATION == 1), and in 10 s

when the integers are distant (SEPARATION == 16). It is an unlikely problem (it can happen,

however), one that you wouldn't even look for unless you did some careful performance tuning

and noticed extensive CPU stalls. Without specialized memory tools, the only way you could find

Search WWH :

Custom Search