Organize your data so that when you do load a cache line, you make maximum use of it and you
Any Other Loop Optimizations
There are all sorts of things you might be able to do to assist the compiler in performing
optimizations that it can't do by itself for some reason: inlining functions, loop unrolling, loop
interchange, loop fusion, etc. Generally, these things are done by the optimizer. We will look at
the assembly code for very tight loops just to verify our expectations. Your vendor documentation
will help here.
Thread-Specific Performance Optimizations
Now that we have wildly emphasized the importance of doing all the normal performance work
first, let's take a look at the stuff that's specific to multithreaded programs. There are just a couple
of performance areas specific to MT: reducing contention, minimizing overhead, and creating the
right number of threads.
Clearly, we do not want to have lots of CPUs waiting around idle because they can't get a mutex
they need. Equally obviously, we cannot neglect proper locking to avoid this contention. Your
options for dealing with this situation are limited by what you're doing.
In some circumstances, you will be able to divide your global data into smaller groups, with more
locks. Then a thread that needs to use item 1 will not block other threads that need item 2. This
will work only if the two items are not used together all the time. This is fine-grained locking.
There is a trade-off between grain size and overhead. Other times, you'll be able to substitute
readers/writer locks for mutexes.
Minimizing MT Overhead
There are a few different threads functions that you might call often enough to make a significant
impact upon performance. The first case is the fine-grained vs. course-grained locking trade-off.
In cases where different data items are used together, making the locking finer-grained will
increase the overhead due to locking, slowing the total performance even though contention may
be reduced. In the friends/ enemies program (Manipulating Lists), it is possible for us to lock
every single list node individually. This will increase the parallelism of the program over the
global mutex design, but total runtime will be many times worse.
What is the right granularity? It will be obvious in most cases, but sometimes the only solution is
In most operating systems, overlapping I/O and computation can be accomplished without threads.
Most operating systems have some sort of asynchronous I/O that allows you to issue an I/O
request, then go back to what you were doing without waiting for it to complete. When it does
complete, a signal will be sent to your process and you will then ask the operating system which
request it was that completed and deal with it as you please. (Obviously, this is not a direct issue
for Java, which has nothing similar.)
This asynchronous I/O can be awkward to deal with, but it will do the job. Using threads instead
of asynchronous I/O is much easier to program and equally fast (Figure 15-7). The one place
where async I/O will not work is with page faults. When a nonthreaded program takes a page fault,
Search WWH :