Database Reference
In-Depth Information
These two concepts relate to the notions of neighborhoods that we have seen earlier. In
fact,
Path
(
u, v
) is true if and only if
v
is in
N
(
u,
∞), which we define to be
∪
i≥
0
N
(
u, i
).
Thus, the reachability problem is to compute the union of all the neighborhoods of any
radius for a given node
u
. The discussion in
Section 10.8.2
reminds us that we can com-
pute the reachable set for
u
by computing its neighborhoods up to that smallest radius
d
for
which
N
(
u, d
) =
N
(
u, d
+ 1).
The two problems - transitive closure and reachability - are related, but there are many
examples of graphs where reachability is feasible and transitive closure is not. For instance,
suppose we have a Web graph of a billion nodes. If we want to find the pages (nodes)
reachable from a given page, we can do so, even on a single machine with a large main
memory. However, just to produce the transitive closure of the graph could involve 10
18
pairs of nodes, which is not practical, even using a large cluster of computers.
5
10.8.4
Transitive Closure Via MapReduce
When it comes to parallel implementation, transitive closure is actually more readily par-
allelizable than is reachability. If we want to compute
N
(
v,
∞), the set of nodes reachable
from node
v
, without computing the entire transitive closure, we have no option but to com-
pute the sequence of neighborhoods, which is essentially a breadth-first search of the graph
from
v
. In relational terms, suppose we have a relation
Arc
(
X, Y
) containing those pairs (
x,
y
) such that there is an arc
x
→
y
. We want to compute iteratively a relation
Reach
(
X
) that
is the set of nodes reachable from node
v
. After
i
rounds,
Reach
(
X
) will contain all those
nodes in
N
(
v, i
).
Initially,
Reach
(
X
) contains only the node
v
. Suppose it contains all the nodes in
N
(
v, i
)
after some round of MapReduce. To construct
N
(
v, i
+ 1) we need to join
Reach
with the
Arc
relation, then project onto the second component and perform a union of the result with
the old value of
Reach
. In SQL terms, we perform
SELECT DISTINCT Arc.Y
FROM Reach, Arc
WHERE Arc.X = Reach.X;
This query asks us to compute the natural join of
Reach
(
X
) and
Arc
(
X, Y
), which we can
do by MapReduce as explained in
Section 2.3.7
.
Then we have to group the result by
Y
and
eliminate duplicates, a process that can be done by another MapReduce job as in
Section
How many rounds this process requires depends on how far from
v
is the furthest node
that
v
can reach. In many social-network graphs, the diameter is small, as discussed in