Information Technology Reference
In-Depth Information
on a distinct segment. This assumption fits very well with a shared-memory archi-
tecture, i.e., arrays x and y are shared among all processors. Such shared-memory
programming closely resembles standard serial programming, but there are differ-
ences that require special attention. For example, the index integer i can not be
shared between processors, because each processor needs to use its own i index to
traverse an assigned segment of x and y .
In the case of distributed memory, the rule of the thumb is to avoid allocating
global data structures if possible. Therefore, each processor typically only allo-
cates two local arrays, x p and y p , which are of length n p and correspond to the
assigned piece of the global x and y arrays. Distributed-memory programming thus
has to consider more details, such as local data allocation and mapping between
local and global indices. However, the pure computing part on a distributed-memory
architecture is rather simple, as follows:
for (i=0; i<n_p; i++)
y_p[i] = f(x_p[i]);
10.2.4
Example 2 of Data Parallelism
Parallelizing the previous example of function evaluation is very simple, because the
processors can work completely independently of each other. However, such embar-
rassingly parallel examples with no collaboration between the processors are rare in
scientific computing. Let us now look at another example where inter-processor
collaboration is needed.
The composite trapezoidal rule of numerical integration was derived in Chap. 1,
where the original formula was given as (1.16). Its purpose is to approximate the
integral R b
a
f.x/dx using n C 1 equally spaced samples. Before we discuss its
parallelization, let us first recall the computationally more efficient formula of this
numerical integration rule, which was given earlier as (6.2):
1
!
Z b
2 .f .a/ C f.b//C n X
i D1
b a
n
f.x/dx h
f.aC ih/
;
hD
:
(10.6)
a
Looking at (10.6), we see that it is quite similar to the previous example of
function evaluation. The difference is that the evaluated function values need to
be summed up, where the two end-points receive a half weight, in comparison with
the n 1 inner points. An important observation is that summation of a large set of
values can be achieved by letting each processor sum up a distinct subset and then
adding up all the partial sums. Similar to the previous example, parallelization starts
with dividing the n 1 inner-point function evaluations on P processors, which also
carry out a subsequent partial summation. More specifically, the sampling points
x 1 ;x 2 ;:::;x n1 ,where x i D a C ih , are segmented into P equal pieces. Then
each processor can compute a partial sum in the form:
 
Search WWH ::




Custom Search