Mining Social-Network Graphs - Mining of Massive Datasets

Database Reference

In-Depth Information

the box on “Six Degrees of Separation.” If so, computing reachability in parallel, using

MapReduce or another approach, is feasible. Few rounds of computation will be needed

and the space requirements are not greater than the space it takes to represent the graph.

However, there are some graphs where the number of rounds is a serious impediment.

For instance, in a typical portion of the Web, it has been found that most pages reachable

from a given page are reachable by paths of length 10-15. Yet there are some pairs of pages

such that the first reaches the second, but only through paths whose length is measured in

the hundreds. For instance, blogs are sometimes structured so each response is reachable

only through the comment to which it responds. Running arguments lead to long paths with

no way to “shortcut” around that path. Or a tutorial on the Web, with 50 chapters, may be

structured so you can only get to Chapter i through the page for Chapter i −1.

Interestingly, the transitive closure can be computed much faster in parallel than can

strict reachability. By a recursive-doubling technique, we can double the length of paths

we know about in a single round. Thus, on a graph of diameter d , we need only log 2 d

rounds, rather than d rounds. If d = 6, the difference is not important, but if d = 1000, log 2

d is about 10, so there is a hundredfold reduction in the number of rounds. The problem, as

discussed above, is that while we can compute the transitive closure quickly, we must com-

pute many more facts than are needed for a reachability computation on the same graph,

and therefore the space requirements for transitive closure can greatly exceed the space re-

quirements for reachability. That is, if all we want is the set Reach ( v ), we can compute the

transitive closure of the entire graph, and then throw away all pairs that do not have v as

their first component. But we cannot throw away all those pairs until we are done. During

the computation of the transitive closure, we could wind up computing many facts Path ( x,

y ), where neither x nor y is reachable from v , and even if they are reachable from v , we may

not need to know x can reach y .

Assuming the graph is small enough that we can compute the transitive closure in its

entirety, we still must be careful how we do so using MapReduce or another parallelism

approach. The simplest recursive-doubling approach is to start the the relation Path ( X, Y )

equal to the arc relation Arc ( X, Y ). Suppose that after i rounds, Path ( X, Y ) contains all pairs

( x, y ) such that there is a path from x to y of length at most 2 i . Then if we join Path with

itself at the next round, we shall discover all those pairs ( x, y ) such that there is a path from

x to y of length at most 2 × 2 i = 2 i +1 . The recursive-doubling query in SQL is

SELECT DISTINCT p1.X, p2.Y

FROM Path p1, Path p2

WHERE p1.Y = p2.X;

After computing this query, we get all pairs connected by a path of length between 2 and

2 i +1 , assuming Path contains pairs connected by paths of length between 1 and 2 i . If we

Search WWH ::

Custom Search

Home