Database Reference
In-Depth Information
8.2.3 C omPuting r aDius anD D iameter
Typical algorithms to compute the radius and the diameter of a graph include Breadth
First Search (BFS) and Floyd's algorithm [19] when no negative cycles are present.
Both approaches are prohibitively slow for large-scale graphs, requiring O ( n 2 + nm )
and O ( n 3 ) time, respectively. For the same reason, related BFS or all-pair shortest-
path based algorithms like [8,22,35,46] cannot handle large-scale graphs.
A sampling approach starts BFS from a subset of nodes, typically chosen at ran-
dom as in [16]. Despite its practicality, this approach has no obvious solution for
choosing the representative sample for BFS. An interesting approach has been pro-
posed by Cohen [18], but according to practitioner's experience [13] it appears not
to be as scalable as the ANF algorithm [42]. The latter is closely related to our work
since it is a sequential algorithm based on Flajolet-Martin sketches [23]. We review
its key idea in the next section.
8.2.4
D istinCt e lements in m ultisets
Let A = { a 1 ,…, a m } be a multiset where a i ∈ [n] for all i = 1,…, m. Let m i = |{ j : a j  = i }|.
For each k ≥ 0 define F
1 . The numbers F k are called frequency moments
of the multiset, and provide useful statistics. We notice that F 0 is the number of
distinct elements in A , F 1 = m . Historically, Morris was the first to show that F 1
can be approximated with O (log log m ) = O (log log n ) [40]. Flajolet and Martin
designed an algorithm that needs O (log n ) bits of memory to approximate F 0 [23].
Since PEGASUS implements Flajolet-Martin sketches, we briefly describe the main
idea of their algorithm. Let h : M → {0, 1,…, 2 L − 1} be a hash function that hashes
the elements of the multiset uniformly over the allowed range of values. Here, L has
to satisfy 2 L n . The algorithm maintains a bitmask, which is initially set to zero
and its i ith bit is set to one if there exists an element a j A whose hash value h ( a j )
ends in i consecutive zeros. Notice that the pattern 0 i of i trailing zeros occurs with
probability 2 i under the assumption of uniform hashing. Therefore, when i is signifi-
cantly larger than log 2 ( n ) we expect to observe zeros but when i ≈ log 2 ( n ) we expect
to observe both. Flajolet and Martin showed that the expected value of the leftmost
zero position R in the bitmask is E ( R ) = log 2 n ) where ϕ = 0:77351 and that the
variance is 1.12. To reduce the variance, typically, we use several hash functions and
use the average. For further improvements on this problem, the interested reader may
read [6,9-11,18,21,24,26,48]. Recently, Kane, Nelson, and Woodruff provided an
optimal algorithm for estimating F 0 [27].
Before we delve into the details of the proposed algorithm, we outline the high-
level goal of PEGASUS. A large number of important computational tasks can
be solved by iterating matrix-vector multiplication. For instance, computing the
diameter, radius distributions, PageRank scores, spectral clustering and connected
components fall into this class of computational problems. PEGASUS implements
optimized programming primitives to handle this class of problems.
Consider the following assignment v ′ ← M × v where M
n
k
=
m
k
i
i
=
mn
×
; v ∈ n . The i th
n
coordinate of v ′ is
v
′ =
m
vi
,
=
1
,
. Typically, in our applications, M is
,
m
i
i jj
,
j
=
1
Search WWH ::




Custom Search