Information Technology Reference
In-Depth Information
communication overhead. However, to enable communication-computation overlap
can be a challenging programming task. Second, it sometimes happens that by using
many processors on a distributed-memory system, the subproblem per processor
suddenly acquires a much better utilization of the local cache, compared with solv-
ing the entire serial problem using one CPU and its local cache. This provides the
potential of superlinear speedup, i.e.,
S.P/ > P
, for particular problem sizes.
Example 10.4.
Let us analyze the speedup of the parallel composite trapezoidal rule
(10.7)-(10.10). As we can see from (10.6), the serial algorithm involves
n C 1
func-
tion evaluations plus
O
.n/
additions and multiplications. If we assume that the
function evaluations are much more expensive than the floating-point operations,
the computational cost of the serial algorithm is
T.1;n/ D .n C 1/t
f
;
where
t
f
denotes the cost of one function evaluation. The computational cost of the
parallel algorithm is
T.P;n/ D
n 1
P
t
f
C C d
log
P e C 2t
f
;
(10.20)
2
where the first term corresponds to the cost of computing
s
p
using (10.7), the second
term corresponds to the cost of a reduction operation needed to sum all the
s
p
values,
and the last term corresponds to the cost of computing
f.a/
and
f.b/
in the end.
Therefore, the speedup will be of the following form:
T.1;n/
T.P;n/
.n C 1/t
f
˙
n1
P
t
f
S.P; n/ D
D
C C d
log
P e C 2t
f
2
n C 1
D
˙
n1
P
C 2 C C d
log
2
P e
:
(10.21)
Of course, to compute the actual values of
S.P; n/
, we need to know the value of
the constant
C D C=t
f
. This can be achieved by measuring
T.P;n/
for a number
of different choices of
P
and
n
and estimating the value of
C
by the method of
least squares. Even without knowing the exact value of
C
in (
10.21
), we can say
that the parallel efficiency
.P /
will get further away from 100% for increasing
P
.
This is because the
C d
log
P e
term in the denominator of (
10.21
) increases with
P
, meaning that the speedup will eventually saturate and decrease after passing a
threshold value of
P
, dependent on the actual size of
n
and
C
.
2
Example 10.5.
Tab le
10.1
shows some speedup results measured on a small cluster
of PCs, where the interconnect is 1 GB-Ethernet. The parallel program is written
using the message-passing interface (MPI [14, 26]) and implements the compos-