A Glimpse of Parallel Computing - Elements of Scientific Computing

Information Technology Reference

In-Depth Information

communication overhead. However, to enable communication-computation overlap

can be a challenging programming task. Second, it sometimes happens that by using

many processors on a distributed-memory system, the subproblem per processor

suddenly acquires a much better utilization of the local cache, compared with solv-

ing the entire serial problem using one CPU and its local cache. This provides the

potential of superlinear speedup, i.e., S.P/ > P , for particular problem sizes.

Example 10.4. Let us analyze the speedup of the parallel composite trapezoidal rule

(10.7)-(10.10). As we can see from (10.6), the serial algorithm involves n C 1 func-

tion evaluations plus

O .n/ additions and multiplications. If we assume that the

function evaluations are much more expensive than the floating-point operations,

the computational cost of the serial algorithm is

T.1;n/ D .n C 1/t f ;

where t f denotes the cost of one function evaluation. The computational cost of the

parallel algorithm is

T.P;n/ D n 1

P

t f

C C d log

P e C 2t f ;

(10.20)

2

where the first term corresponds to the cost of computing s p using (10.7), the second

term corresponds to the cost of a reduction operation needed to sum all the s p values,

and the last term corresponds to the cost of computing f.a/ and f.b/ in the end.

Therefore, the speedup will be of the following form:

T.1;n/

T.P;n/

.n C 1/t f

˙ n1

P

t f

S.P; n/ D

D

C C d log

P e C 2t f

2

n C 1

D

˙ n1

P

C 2 C C d log 2 P e :

(10.21)

Of course, to compute the actual values of S.P; n/ , we need to know the value of

the constant C D C=t f . This can be achieved by measuring T.P;n/ for a number

of different choices of P and n and estimating the value of C by the method of

least squares. Even without knowing the exact value of C in ( 10.21 ), we can say

that the parallel efficiency .P / will get further away from 100% for increasing P .

This is because the C d log

P e term in the denominator of ( 10.21 ) increases with

P , meaning that the speedup will eventually saturate and decrease after passing a

threshold value of P , dependent on the actual size of n and C .

2

Example 10.5. Tab le 10.1 shows some speedup results measured on a small cluster

of PCs, where the interconnect is 1 GB-Ethernet. The parallel program is written

using the message-passing interface (MPI [14, 26]) and implements the compos-

Elements of Scientific Computing

Search WWH ::

Custom Search

Home