Hardware Reference
In-Depth Information
If each CPU is capable of outputting 1 MB/sec, it does little good that the intercon-
nect has a bisection bandwidth of 100 GB/sec. Communication will be limited by
how much data each CPU can output.
In practice, actually achieving anything even close to the theoretical bandwidth
is very difficult. Many sources of overhead work to reduce the capacity. For ex-
ample, there is always some per-packet overhead associated with each packet:
assembling it, building its header, and getting it going. Sending 1024 4-byte pack-
ets will never achieve the same bandwidth as sending one 4096-byte packet. Un-
fortunately, for achieving low latencies, using small packets is better, since large
ones block the lines and switches too long. Thus there is an inherent conflict be-
tween achieving low average latencies and high-bandwidth utilization. For some
applications, one is more important than the other and for other applications it is
the other way around. It is worth noting, however, that you can always buy more
bandwidth (by putting in more or wider wires), but you cannot buy lower latencies.
Thus it is generally better to err on the side of making latencies as short as pos-
sible, and worry about bandwidth later.
Software Metrics
Hardware metrics like latency and bandwidth look at what the hardware is ca-
pable of doing. However, users have a different perspective. They want to know
how much faster their programs are going to run on a parallel computer than on a
uniprocessor. For them, the key metric is speed-up: how much faster a program
runs on an n -processor system than on a one-processor system. Typically these re-
sults are shown in graphs like those of Fig. 8-49. Here we see several different par-
allel programs run on a multicomputer consisting of 64 Pentium Pro CPUs. Each
curve shows the speed-up of one program with k CPUs as a function of k . Perfect
speed-up is indicated by the dotted line, in which using k CPUs makes the program
go k times faster, for any k . Few programs achieve perfect speed-up, but some
come close. The N -body problem parallelizes extremely well; awari (an African
board game) does reasonably well; but inverting a certain skyline matrix does not
go more than five times faster no matter how many CPUs are available. The pro-
grams and results are discussed in Bal et al. (1998).
Part of the reason that perfect speed-up is nearly impossible to achieve is that
almost all programs have some sequential component, often the initialization
phase, reading in the data, or collecting the results. Having many CPUs does not
help here. Suppose that a program runs for T sec on a uniprocessor, with a fraction
f of this time being sequential code and a fraction (1
f ) being potentially paral-
lelizable, as shown in Fig. 8-50(a). If the latter code can be run on n CPUs with no
overhead, its execution time can be reduced from (1
f ) T / n at best,
as shown in Fig. 8-50(b). This gives a total execution time for the sequential and
parallel parts of fT
f ) T to (1
f ) T / n . The speed-up is just the execution time of the
original program, T , divided by this new execution time:
+
(1
 
Search WWH ::




Custom Search