Database Reference
In-Depth Information
are two centralized graph processing engines that do not partition data graphs to
multiple machines. Both can support online query processing. However, their scal-
ability is limited, because they cannot handle very large graphs efficiently, due to
the costly disk accesses. For distributed graph processing systems, many engines
are disk-based, mainly for reliability, and scalability. In-memory graph explorations
resolve the random I/O bottleneck of Trinity and Surfer. As for graph partitioning,
most graph engines (except Surfer) use random hash partitioning by default. Surfer
adopts the network performance aware graph partitioning, specifically designed for
cloud environments.
To illustrate the differences between these systems with an example, we briefly
compare their reported performances of the PageRank computation, which is a typi-
cal algorithm for benchmarking graph processing.
Neo4j and HyperGraphDB are centralized graph processing engines, and hence, the
graphs that they can process are obviously limited by the centralized server. As graph
processing often involves random data accesses, graphs that cannot fit into main mem-
ory may incur numerous disk accesses that significantly affects performance.
Pegasus supports iterative matrix-vector multiplications and implemented the
matrix approach for computing PageRank and its optimizations. Their experiments
confirmed that performance improves as the number of machines increases and per-
formance scales linearly as graph size increases beyond 1 billion nodes. Pegasus
finished one PageRank iteration in around 100 seconds on YahooWeb (1.4 billion
nodes) under the default setting of 9 supercomputers.
Pregel implemented a vertex-oriented PageRank algorithm under the message
passing paradigm. To handle large graph data, its original implementation uses a
default random hash function to partition graph data to worker processes. Giraph [4]
is a publicly available implementation for Pregel. Its performance of PageRank has
been reported in comparison with Trinity [67]. Trinity also implemented a vertex-
oriented PageRank algorithm and ran it on eight commodity machines. Trinity keeps
graphs in the main memory, which leads to superior efficiency. In particular, Trinity
completes one PageRank iteration on a 1 billion node graph in less than 1 minute.
In comparison, Giraph is at least two orders of magnitude slower and runs out of
memory processing some large graphs.
Surfer tested the Network Rank algorithm* on a social network that consists of
more than half a billion nodes and about 30 billion edges. The main focus of Surfer
is to show the improvement in performance due to its network performance aware
graph partitioning. The speedup of the response time observed from the MapReduce
engine built on top of a MapReduce platform ranges from 1.7 to 5.8 times faster than
the original response times, under different network topology settings.
7.3.3 o ther g raPh P roCessing P latForms /s ystems
We have recently seen that cloud computing platforms have been equipped with
emerging hardware such as multicore CPUs and GPUs (Graphics Processing Units).
* Network ranking is the generation of a ranking on the vertices in the graph using PageRank or its
variants.
Search WWH ::




Custom Search