Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

Data parallelism is achieved when each machine runs one or many tasks over

different partitions of data. As a specific example, assume array A is shared among

three machines in a distributed shared memory system. Consider also a distributed

program that simply adds all elements of array A . It is possible to charge machines 1,

2, and 3 to run the addition task, each on 1/3 of A , or 50 elements, as shown in Figure

1.9. The data can be allocated across tasks using the shared-memory programming

model, which requires a synchronization mechanism. Clearly, such a program is

SPMD. In contrast, array A can also be partitioned evenly and distributed across

three machines using the message-passing model as shown in Figure 1.10. Each

machine will run the addition task independently; nonetheless, summation results

will have to be eventually aggregated at one main task to generate a grand total. In

such a scenario, every task is similar in a sense that it is performing the same addi-

tion operation, yet on a different part of A . The main task, however, is further aggre-

gating summation results, thus making it a little different than the other two tasks.

Obviously, this makes the program MPMD.

As a real example, MapReduce uses data parallelism. In particular, input data

sets are partitioned by HDFS into blocks (by default, 64 MB per block) allow-

ing MapReduce to effectively exploit data parallelism via running a map task per

one or many blocks (by default, each map task processes only one HDFS block).

Furthermore, as map tasks operate on HDFS blocks, reduce tasks operate on the

output of map tasks denoted as intermediate output or partitions . In principle,

each reduce task can process one or many partitions. As a consequence, the data

processed by map and reduce tasks become different. Moreover, map and reduce

tasks are inherently dissimilar (i.e., the map and the reduce functions incorporate

different binary codes). Therefore, MapReduce jobs lie under the MPMD category.

Graph parallelism contrasts with data parallelism. Graph parallelism is another

form of parallelism that focuses more on distributing graphs as opposed to data.

Indeed, most distributed programs fall somewhere on a continuum between data

Array A

…

do i=0, 49

local_sum = local_sum +A[i];

end do

lock(mylock);

grand_sum = grand_sum +

local_sum;

unlock(mylock);

…

do i=50, 99

local_sum = local_sum +A[i];

end do

lock(mylock);

grand_sum = grand_sum +

local_sum;

unlock(mylock);

…

do i=100, 149

local_sum = local_sum +A[i];

end do

lock(mylock);

grand_sum = grand_sum +

local_sum;

unlock(mylock);

…

Node 1

Node 2

Node 3

FIGURE 1.9

An SPMD distributed program using the shared-memory programming model.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home