Atomic Reads and Writes - Interlocked Instructions - Memory Systems - Reducing Cache Misses - Multithreaded Programming with JAVA

[9]

Dekker's algorithm is basically a clever way of building locks for a known number of threads

based on the assumption that writes to main memory arrive in order. Without a volatile declaration,

it is not possible to implement on an out-of-order machine.

The definition of volatile does imply that 64-bit data (e.g., doubles and longs) will be treated

atomically. On a machine that does 64-bit writes (UltraSPARC, Alpha, etc.), this is

straightforward to implement. On a machine that only does 32-bit writes (SPARC v7, x86, etc.)

this is a bit more difficult. To meet the definition of volatile, it is actually necessary for 32-bit

machines to maintain a lock specifically for 64-bit volatile data.

It is unlikely that you will ever use volatile at all.[10] Be careful!

[10]

In the 2nd edition of his topic, Doug Lea (see Appendix B) is including a simple example of using

volatile to manage a list. The code is amazingly complex for such a simple problem. I figure that if

it's that difficult for Doug to get it right, I don't want to be doing it at all!

Atomic Reads and Writes

Most writes on most systems are either 32 or 64 bits wide, aligned on the appropriate word

boundary. Such writes are atomic--you never have to be concerned that the first 16 bits have been

written before the second 16 bits. What good does this do you? Very little. You cannot test a value

because it may change. You cannot increment a value because someone else might be using it at

the same time. You may be able to figure out a tricky way of combining this fact with a

volatile declaration to allow you to avoid using a lock, but it probably won't make your

program go any faster.

Interlocked Instructions

In Win32 there are a set of interlocked functions [InterlockedIncrement(),

InterlockedDecrement(), and InterlockedExchange()] which call the respective

interlocked instructions on x86 machines. These instructions provide equivalent functionality to

the fancy synchronization instructions we just looked at. Although some Win32 programmers will

gladly attest to their glory, they don't actually provide the C program with much value, and they

cannot be called from Java. Use locks!

Memory Systems

The memory system in modern SMP machines is designed to be ignored. You shouldn't have to

spend any time thinking about it--it should just work. And it succeeds in this, to a degree. As long

as you are writing a program that is reasonably well behaved and that doesn't have overwhelming

needs for absolute maximum performance, you can skip over this section. Probably 95% of all

programs fit into this category. As for the other 5% percent...

In 1980, memory speeds were about the same as CPU speeds, and a machine could access main

memory in a single cycle. Since then, DRAM speeds have improved by an order of magnitude and

CPU speeds by almost four. Direct main memory access now costs between 30 and 100 CPU

cycles. It is not at all unusual for a CPU to spend over half its time stalled, waiting for memory.

To the degree that you can reduce the number of main memory accesses (i.e., cache misses), you

will be handsomely paid in program performance. (N.B.: There is nothing unique to MP machines

or MT programs here.)

Reducing Cache Misses

Search WWH :

Custom Search