Network Performance Aware Graph Partitioning for Large Graph Processing Systems in the Cloud - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

are two centralized graph processing engines that do not partition data graphs to

multiple machines. Both can support online query processing. However, their scal-

ability is limited, because they cannot handle very large graphs efficiently, due to

the costly disk accesses. For distributed graph processing systems, many engines

are disk-based, mainly for reliability, and scalability. In-memory graph explorations

resolve the random I/O bottleneck of Trinity and Surfer. As for graph partitioning,

most graph engines (except Surfer) use random hash partitioning by default. Surfer

adopts the network performance aware graph partitioning, specifically designed for

cloud environments.

To illustrate the differences between these systems with an example, we briefly

compare their reported performances of the PageRank computation, which is a typi-

cal algorithm for benchmarking graph processing.

Neo4j and HyperGraphDB are centralized graph processing engines, and hence, the

graphs that they can process are obviously limited by the centralized server. As graph

processing often involves random data accesses, graphs that cannot fit into main mem-

ory may incur numerous disk accesses that significantly affects performance.

Pegasus supports iterative matrix-vector multiplications and implemented the

matrix approach for computing PageRank and its optimizations. Their experiments

confirmed that performance improves as the number of machines increases and per-

formance scales linearly as graph size increases beyond 1 billion nodes. Pegasus

finished one PageRank iteration in around 100 seconds on YahooWeb (1.4 billion

nodes) under the default setting of 9 supercomputers.

Pregel implemented a vertex-oriented PageRank algorithm under the message

passing paradigm. To handle large graph data, its original implementation uses a

default random hash function to partition graph data to worker processes. Giraph [4]

is a publicly available implementation for Pregel. Its performance of PageRank has

been reported in comparison with Trinity [67]. Trinity also implemented a vertex-

oriented PageRank algorithm and ran it on eight commodity machines. Trinity keeps

graphs in the main memory, which leads to superior efficiency. In particular, Trinity

completes one PageRank iteration on a 1 billion node graph in less than 1 minute.

In comparison, Giraph is at least two orders of magnitude slower and runs out of

memory processing some large graphs.

Surfer tested the Network Rank algorithm* on a social network that consists of

more than half a billion nodes and about 30 billion edges. The main focus of Surfer

is to show the improvement in performance due to its network performance aware

graph partitioning. The speedup of the response time observed from the MapReduce

engine built on top of a MapReduce platform ranges from 1.7 to 5.8 times faster than

the original response times, under different network topology settings.

7.3.3 o ther g raPh P roCessing P latForms /s ystems

We have recently seen that cloud computing platforms have been equipped with

emerging hardware such as multicore CPUs and GPUs (Graphics Processing Units).

* Network ranking is the generation of a ranking on the vertices in the graph using PageRank or its

variants.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home