Database Reference
In-Depth Information
Data parallelism is achieved when each machine runs one or many tasks over
different partitions of data. As a specific example, assume array A is shared among
three machines in a distributed shared memory system. Consider also a distributed
program that simply adds all elements of array A . It is possible to charge machines 1,
2, and 3 to run the addition task, each on 1/3 of A , or 50 elements, as shown in Figure
1.9. The data can be allocated across tasks using the shared-memory programming
model, which requires a synchronization mechanism. Clearly, such a program is
SPMD. In contrast, array A can also be partitioned evenly and distributed across
three machines using the message-passing model as shown in Figure 1.10. Each
machine will run the addition task independently; nonetheless, summation results
will have to be eventually aggregated at one main task to generate a grand total. In
such a scenario, every task is similar in a sense that it is performing the same addi-
tion operation, yet on a different part of A . The main task, however, is further aggre-
gating summation results, thus making it a little different than the other two tasks.
Obviously, this makes the program MPMD.
As a real example, MapReduce uses data parallelism. In particular, input data
sets are partitioned by HDFS into blocks (by default, 64 MB per block) allow-
ing MapReduce to effectively exploit data parallelism via running a map task per
one or many blocks (by default, each map task processes only one HDFS block).
Furthermore, as map tasks operate on HDFS blocks, reduce tasks operate on the
output of map tasks denoted as intermediate output or partitions . In principle,
each reduce task can process one or many partitions. As a consequence, the data
processed by map and reduce tasks become different. Moreover, map and reduce
tasks are inherently dissimilar (i.e., the map and the reduce functions incorporate
different binary codes). Therefore, MapReduce jobs lie under the MPMD category.
Graph parallelism contrasts with data parallelism. Graph parallelism is another
form of parallelism that focuses more on distributing graphs as opposed to data.
Indeed, most distributed programs fall somewhere on a continuum between data
Array A
do i=0, 49
local_sum = local_sum +A[i];
end do
lock(mylock);
grand_sum = grand_sum +
local_sum;
unlock(mylock);
do i=50, 99
local_sum = local_sum +A[i];
end do
lock(mylock);
grand_sum = grand_sum +
local_sum;
unlock(mylock);
do i=100, 149
local_sum = local_sum +A[i];
end do
lock(mylock);
grand_sum = grand_sum +
local_sum;
unlock(mylock);
Node 1
Node 2
Node 3
FIGURE 1.9
An SPMD distributed program using the shared-memory programming model.
Search WWH ::




Custom Search