One basic fact should be acknowledged up front: There is always a limit. For every program or
system load that you can imagine, there is an optional number of CPUs to run it on. Adding more
CPUs to the machine will slow it down.
You could, if you wanted, build a 1-million-CPU SMP machine. It just wouldn't be very efficient.
And while we can invent programs that would make good use of all 1 million CPUs (e.g., analyze
all 20 move chess games), they would be highly contrived. Most "normal" programs can make use
of only a small number of CPUs (typically, 220).
Let's start by looking at some data from some simple programs (Figure 15-2). These are
numerically intensive programs that run entirely in memory. Because there is no I/O involved, and
because the amount of shared data is often quite limited, all of these programs show a superb
scaling up to 16 CPUs.
Figure 15-2. Parallel Speedup on Several Numerical Programs
Fast Fourier transforms are performed by a set of matrix manipulations. It is characterized by
largely independent operations with significant interthread communication in only one section.
The next three programs all have largely constant amounts of interthread communications. LU
factorization is dense matrix factorization, and also performed by a set of matrix manipulations.
Barnes-Hut is an N-body simulation for solving a problem in galaxy evolution. Ocean simulates
the effects of certain currents on large-scale flow in the ocean.
Notice that all of these programs do show a falloff in performance for each additional CPU. At
some point, that falloff will drop below zero and begin to slow the total throughput. Why? Well,
let's take a look at where these programs are spending their time. As you can see from Figure 15-3,
the amount of time that the CPUs actually spend working on the problem drops as the number of
CPUs increases. Notice that memory overhead can easily occupy 50% for total CPU time. On
database-style programs, it can exceed 50%. The requirement for synchronization takes up more
and more of the time. Extrapolating out to just 128 CPUs, we can infer that performance would be
Figure 15-3. Program Behavior for Parallelized Benchmarks
Search WWH :