PEGASUS - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

8.2.3 C omPuting r aDius anD D iameter

Typical algorithms to compute the radius and the diameter of a graph include Breadth

First Search (BFS) and Floyd's algorithm [19] when no negative cycles are present.

Both approaches are prohibitively slow for large-scale graphs, requiring O ( n 2 + nm )

and O ( n 3 ) time, respectively. For the same reason, related BFS or all-pair shortest-

path based algorithms like [8,22,35,46] cannot handle large-scale graphs.

A sampling approach starts BFS from a subset of nodes, typically chosen at ran-

dom as in [16]. Despite its practicality, this approach has no obvious solution for

choosing the representative sample for BFS. An interesting approach has been pro-

posed by Cohen [18], but according to practitioner's experience [13] it appears not

to be as scalable as the ANF algorithm [42]. The latter is closely related to our work

since it is a sequential algorithm based on Flajolet-Martin sketches [23]. We review

its key idea in the next section.

8.2.4

D istinCt e lements in m ultisets

Let A = { a 1 ,…, a m } be a multiset where a i ∈ [n] for all i = 1,…, m. Let m i = |{ j : a j = i }|.

For each k ≥ 0 define F

∑ 1 . The numbers F k are called frequency moments

of the multiset, and provide useful statistics. We notice that F 0 is the number of

distinct elements in A , F 1 = m . Historically, Morris was the first to show that F 1

can be approximated with O (log log m ) = O (log log n ) [40]. Flajolet and Martin

designed an algorithm that needs O (log n ) bits of memory to approximate F 0 [23].

Since PEGASUS implements Flajolet-Martin sketches, we briefly describe the main

idea of their algorithm. Let h : M → {0, 1,…, 2 L − 1} be a hash function that hashes

the elements of the multiset uniformly over the allowed range of values. Here, L has

to satisfy 2 L ≥ n . The algorithm maintains a bitmask, which is initially set to zero

and its i ith bit is set to one if there exists an element a j ∈ A whose hash value h ( a j )

ends in i consecutive zeros. Notice that the pattern 0 i of i trailing zeros occurs with

probability 2 − i under the assumption of uniform hashing. Therefore, when i is signifi-

cantly larger than log 2 ( n ) we expect to observe zeros but when i ≈ log 2 ( n ) we expect

to observe both. Flajolet and Martin showed that the expected value of the leftmost

zero position R in the bitmask is E ( R ) = log 2 (ϕ n ) where ϕ = 0:77351 and that the

variance is 1.12. To reduce the variance, typically, we use several hash functions and

use the average. For further improvements on this problem, the interested reader may

read [6,9-11,18,21,24,26,48]. Recently, Kane, Nelson, and Woodruff provided an

optimal algorithm for estimating F 0 [27].

Before we delve into the details of the proposed algorithm, we outline the high-

level goal of PEGASUS. A large number of important computational tasks can

be solved by iterating matrix-vector multiplication. For instance, computing the

diameter, radius distributions, PageRank scores, spectral clustering and connected

components fall into this class of computational problems. PEGASUS implements

optimized programming primitives to handle this class of problems.

Consider the following assignment v ′ ← M × v where M

; v ∈ n . The i th

∈



∑

coordinate of v ′ is

′ =

 . Typically, in our applications, M is

i jj

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home