PARALLEL COMPUTER ARCHITECTURES - Structured Computer Organization

Hardware Reference

In-Depth Information

If each CPU is capable of outputting 1 MB/sec, it does little good that the intercon-

nect has a bisection bandwidth of 100 GB/sec. Communication will be limited by

how much data each CPU can output.

In practice, actually achieving anything even close to the theoretical bandwidth

is very difficult. Many sources of overhead work to reduce the capacity. For ex-

ample, there is always some per-packet overhead associated with each packet:

assembling it, building its header, and getting it going. Sending 1024 4-byte pack-

ets will never achieve the same bandwidth as sending one 4096-byte packet. Un-

fortunately, for achieving low latencies, using small packets is better, since large

ones block the lines and switches too long. Thus there is an inherent conflict be-

tween achieving low average latencies and high-bandwidth utilization. For some

applications, one is more important than the other and for other applications it is

the other way around. It is worth noting, however, that you can always buy more

bandwidth (by putting in more or wider wires), but you cannot buy lower latencies.

Thus it is generally better to err on the side of making latencies as short as pos-

sible, and worry about bandwidth later.

Software Metrics

Hardware metrics like latency and bandwidth look at what the hardware is ca-

pable of doing. However, users have a different perspective. They want to know

how much faster their programs are going to run on a parallel computer than on a

uniprocessor. For them, the key metric is speed-up: how much faster a program

runs on an n -processor system than on a one-processor system. Typically these re-

sults are shown in graphs like those of Fig. 8-49. Here we see several different par-

allel programs run on a multicomputer consisting of 64 Pentium Pro CPUs. Each

curve shows the speed-up of one program with k CPUs as a function of k . Perfect

speed-up is indicated by the dotted line, in which using k CPUs makes the program

go k times faster, for any k . Few programs achieve perfect speed-up, but some

come close. The N -body problem parallelizes extremely well; awari (an African

board game) does reasonably well; but inverting a certain skyline matrix does not

go more than five times faster no matter how many CPUs are available. The pro-

grams and results are discussed in Bal et al. (1998).

Part of the reason that perfect speed-up is nearly impossible to achieve is that

almost all programs have some sequential component, often the initialization

phase, reading in the data, or collecting the results. Having many CPUs does not

help here. Suppose that a program runs for T sec on a uniprocessor, with a fraction

f of this time being sequential code and a fraction (1

f ) being potentially paral-

lelizable, as shown in Fig. 8-50(a). If the latter code can be run on n CPUs with no

overhead, its execution time can be reduced from (1

−

f ) T / n at best,

as shown in Fig. 8-50(b). This gives a total execution time for the sequential and

parallel parts of fT

−

f ) T to (1

−

f ) T / n . The speed-up is just the execution time of the

original program, T , divided by this new execution time:

+

(1

−

Structured Computer Organization

Search WWH ::

Custom Search

Home