PEGASUS - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

8.6.4 Static Patterns ................................................................................... 275

8.6.4.1 Diameter............................................................................. 275

8.6.4.2 Shape of Distribution ......................................................... 278

8.6.4.3 Radius Plot of GCC ........................................................... 280

8.6.4.4 “Core” and “Whisker” Vertices ......................................... 280

8.6.5 Temporal Patterns ............................................................................. 282

8.7 Conclusion .................................................................................................... 283

References .............................................................................................................. 284

8.1 INTRODUCTION

The scale of graph data that is nowadays collected and required to be processed is mas-

sive. For example, in the context of online services, the web graph amounts to at least

1 trillion of links [1]. Facebook recently reported more than 1 billion of users and 140

billion of friend connections [2], and Twitter reported in 2009 more than 40 million

of users and about 1.5 billion of social relations [30]. The unprecedented proliferation

of data provides us with new opportunities and benefits but also poses hard compu-

tational challenges. Frequent graph computations such as community detection [25],

finding connected components [29], iterative computations using graph input data such

as computing PageRank and its variations [41], and shortest path and radius computa-

tions [17,28] become challenging computational tasks in the realm of big graph data.

In this chapter, we describe PEGASUS, an open-source Peta Graph Mining library,

which performs typical graph mining tasks such as computing the diameter of a graph,

computing the radius of each node, finding the connected components, and computing

the importance score of nodes. The main idea behind PEGASUS is to capitalize on

matrix-vector multiplication as a main primitive for the software engineer. Inspired by

the work of [47], which showed that triangles can be estimated by few matrix-vector

multiplications, PEGASUS introduces a set of different operators that solve a variety

of graph mining tasks together with an optimized implementation of matrix-vector

multiplications in MapReduce. PEGASUS is a solid engineering effort that allows us

to manipulate large-scale graphs. Since the introduction of PEGASUS, other large-

scale graph processing systems have been introduced, among them Google's Pregel

[36], LinkedIn's Giraph [3], and GraphLab [34]. It is worth mentioning that Giraph

uses several algorithms and ideas from PEGASUS, including the connected compo-

nents algorithm. Also, PEGASUS has been included in Hadoop for Windows Azure

[4]. PEGASUS provides us with the ability to investigate the structure of large-scale

graphs. This allows us to detect outliers and obtain significant structural patterns.

This chapter is organized as follows: Section 8.2 presents related work. Section 8.3

presents the proposed method. Section 8.4 presents Hadoop implementations and Section

8.5 timings. Section 8.6 shows findings of PEGASUS in several real-world networks.

8.2 RELATED WORK

In Section 8.2.1, we discuss two established patterns, the bow-tie structure of the web

and the six degrees of separation theory. In Section 8.2.2, we provide basic graph theo-

retic definitions for our work. In Section 8.2.3, we discuss work related to computing the

Search WWH ::

Custom Search

Home