Distributed Programming for the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

Once a problem is modeled as a graph, it can be distributed over machines in

a distributed system using a graph partitioning technique . Graph partitioning

implies dividing the work (i.e., the vertices) over distributed nodes for efficient

distributed computation. As is the case with data parallelism, the basic idea is

simple; by distributing a large graph across multiple machines, it becomes pos-

sible to process different parts of the graph in parallel. As such, graph partitioning

enables what we refer to as graph parallelism . The standard objective of graph

partitioning is to uniformly distribute the work over p processors by partitioning

the vertices into p equally weighted partitions, while minimizing inter-node com-

munication reflected by edges. Such an objective is typically referred to as the

standard edge cut metric [34]. The graph partitioning problem is NP-hard [21],

yet heuristics can be implemented to achieve near optimal solutions [34,35,39].

As a specific example, Figure 1.11 demonstrates three partitions, P 1 , P 2 , and P 3

at which vertices v 1 ,..., v 8 are divided using the edge cut metric. Each edge has a

weight of 2 corresponding to 1 unit of data being communicated in each direction.

Consequently, the total weight of the shown edge cut is 10. Other cuts will result

in more communication traffic. Clearly, for communication-intensive applications,

graph partitioning is very critical and can play a dramatic role in dictating the

overall application performance. We discuss some of the challenges pertaining to

graph partitioning in Section 1.5.3.

As real examples, both Pregel and GraphLab employ graph partitioning.

Specifically, in Pregel each vertex in a graph is assigned a unique ID, and partition-

ing of the graph is accomplished via using a hash(ID) mod N function, where N is

the number of partitions. The hash function is customizable and can be altered by

users. After partitioning the graph, partitions are mapped to cluster machines using

a mapping function of a user choice. For example, a user can define a mapping func-

tion for a Web graph that attempts to exploit locality by co-locating vertices of the

same Web site (a vertex in this case represents a Web page). In contrast to Pregel,

GraphLab utilizes a two-phase partitioning strategy. In the first phase, the input

P 1

v 1

v 3

A minimum cut with a

total weight of 10

v 2

2

P 3

P 2

2

v 4

v 7

v 8

v 5

v 6

FIGURE 1.11

A graph partitioned using the edge cut metric.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home