Information Technology Reference
In-Depth Information
communication overhead. However, to enable communication-computation overlap
can be a challenging programming task. Second, it sometimes happens that by using
many processors on a distributed-memory system, the subproblem per processor
suddenly acquires a much better utilization of the local cache, compared with solv-
ing the entire serial problem using one CPU and its local cache. This provides the
potential of superlinear speedup, i.e., S.P/ > P , for particular problem sizes.
Example 10.4. Let us analyze the speedup of the parallel composite trapezoidal rule
(10.7)-(10.10). As we can see from (10.6), the serial algorithm involves n C 1 func-
tion evaluations plus
O .n/ additions and multiplications. If we assume that the
function evaluations are much more expensive than the floating-point operations,
the computational cost of the serial algorithm is
T.1;n/ D .n C 1/t f ;
where t f denotes the cost of one function evaluation. The computational cost of the
parallel algorithm is
T.P;n/ D n 1
P
t f
C C d log
P e C 2t f ;
(10.20)
2
where the first term corresponds to the cost of computing s p using (10.7), the second
term corresponds to the cost of a reduction operation needed to sum all the s p values,
and the last term corresponds to the cost of computing f.a/ and f.b/ in the end.
Therefore, the speedup will be of the following form:
T.1;n/
T.P;n/
.n C 1/t f
˙ n1
P
t f
S.P; n/ D
D
C C d log
P e C 2t f
2
n C 1
D
˙ n1
P
C 2 C C d log 2 P e :
(10.21)
Of course, to compute the actual values of S.P; n/ , we need to know the value of
the constant C D C=t f . This can be achieved by measuring T.P;n/ for a number
of different choices of P and n and estimating the value of C by the method of
least squares. Even without knowing the exact value of C in ( 10.21 ), we can say
that the parallel efficiency .P / will get further away from 100% for increasing P .
This is because the C d log
P e term in the denominator of ( 10.21 ) increases with
P , meaning that the speedup will eventually saturate and decrease after passing a
threshold value of P , dependent on the actual size of n and C .
2
Example 10.5. Tab le 10.1 shows some speedup results measured on a small cluster
of PCs, where the interconnect is 1 GB-Ethernet. The parallel program is written
using the message-passing interface (MPI [14, 26]) and implements the compos-
 
Search WWH ::




Custom Search