- Figure 15-2. Parallel Speedup on Several Numerical Programs - Figure 15-3. Program Behavior for Parallelized Benchmarks - Multithreaded Programming with JAVA

One basic fact should be acknowledged up front: There is always a limit. For every program or

system load that you can imagine, there is an optional number of CPUs to run it on. Adding more

CPUs to the machine will slow it down.

You could, if you wanted, build a 1-million-CPU SMP machine. It just wouldn't be very efficient.

And while we can invent programs that would make good use of all 1 million CPUs (e.g., analyze

all 20 move chess games), they would be highly contrived. Most "normal" programs can make use

of only a small number of CPUs (typically, 220).

Let's start by looking at some data from some simple programs (Figure 15-2). These are

numerically intensive programs that run entirely in memory. Because there is no I/O involved, and

because the amount of shared data is often quite limited, all of these programs show a superb

scaling up to 16 CPUs.

Figure 15-2. Parallel Speedup on Several Numerical Programs

Fast Fourier transforms are performed by a set of matrix manipulations. It is characterized by

largely independent operations with significant interthread communication in only one section.

The next three programs all have largely constant amounts of interthread communications. LU

factorization is dense matrix factorization, and also performed by a set of matrix manipulations.

Barnes-Hut is an N-body simulation for solving a problem in galaxy evolution. Ocean simulates

the effects of certain currents on large-scale flow in the ocean.

Notice that all of these programs do show a falloff in performance for each additional CPU. At

some point, that falloff will drop below zero and begin to slow the total throughput. Why? Well,

let's take a look at where these programs are spending their time. As you can see from Figure 15-3,

the amount of time that the CPUs actually spend working on the problem drops as the number of

CPUs increases. Notice that memory overhead can easily occupy 50% for total CPU time. On

database-style programs, it can exceed 50%. The requirement for synchronization takes up more

and more of the time. Extrapolating out to just 128 CPUs, we can infer that performance would be

dismal indeed.

Figure 15-3. Program Behavior for Parallelized Benchmarks

Search WWH :

Custom Search