Any Other Loop Optimizations - Thread-Specific Performance Optimizations - Reducing Contention - Minimizing MT Overhead - Reducing Paging - Multithreaded Programming with JAVA

Organize your data so that when you do load a cache line, you make maximum use of it and you

don't have to load it again (see Reducing Cache Misses).

Any Other Loop Optimizations

There are all sorts of things you might be able to do to assist the compiler in performing

optimizations that it can't do by itself for some reason: inlining functions, loop unrolling, loop

interchange, loop fusion, etc. Generally, these things are done by the optimizer. We will look at

the assembly code for very tight loops just to verify our expectations. Your vendor documentation

will help here.

Thread-Specific Performance Optimizations

Now that we have wildly emphasized the importance of doing all the normal performance work

first, let's take a look at the stuff that's specific to multithreaded programs. There are just a couple

of performance areas specific to MT: reducing contention, minimizing overhead, and creating the

right number of threads.

Reducing Contention

Clearly, we do not want to have lots of CPUs waiting around idle because they can't get a mutex

they need. Equally obviously, we cannot neglect proper locking to avoid this contention. Your

options for dealing with this situation are limited by what you're doing.

In some circumstances, you will be able to divide your global data into smaller groups, with more

locks. Then a thread that needs to use item 1 will not block other threads that need item 2. This

will work only if the two items are not used together all the time. This is fine-grained locking.

There is a trade-off between grain size and overhead. Other times, you'll be able to substitute

readers/writer locks for mutexes.

Minimizing MT Overhead

There are a few different threads functions that you might call often enough to make a significant

impact upon performance. The first case is the fine-grained vs. course-grained locking trade-off.

In cases where different data items are used together, making the locking finer-grained will

increase the overhead due to locking, slowing the total performance even though contention may

be reduced. In the friends/ enemies program (Manipulating Lists), it is possible for us to lock

every single list node individually. This will increase the parallelism of the program over the

global mutex design, but total runtime will be many times worse.

What is the right granularity? It will be obvious in most cases, but sometimes the only solution is

to experiment.

Reducing Paging

In most operating systems, overlapping I/O and computation can be accomplished without threads.

Most operating systems have some sort of asynchronous I/O that allows you to issue an I/O

request, then go back to what you were doing without waiting for it to complete. When it does

complete, a signal will be sent to your process and you will then ask the operating system which

request it was that completed and deal with it as you please. (Obviously, this is not a direct issue

for Java, which has nothing similar.)

This asynchronous I/O can be awkward to deal with, but it will do the job. Using threads instead

of asynchronous I/O is much easier to program and equally fast (Figure 15-7). The one place

where async I/O will not work is with page faults. When a nonthreaded program takes a page fault,

Search WWH :

Custom Search