Database Reference
In-Depth Information
find the size of the reachable set for each node v , i.e., the set we have called N ( v, ∞). Recall
that for a graph of a billion nodes, it is totally infeasible to compute the neighborhoods for
each node, even using a very large cluster of machines. However, even if we want only a
count of the nodes in each neighborhood, we need to remember the nodes found so far as
we explore the graph, or else we shall not know whether a found node is new or one we
have seen already.
On the other hand, it is not so hard to find an approximation to the size of each neigh-
borhood, using the Flajolet-Martin technique discussed in Section 4.4.2 . Recall that this
technique uses some large number of hash functions; in this case, the hash functions are
applied to the nodes of the graph. The important property of the bit string we get when we
apply hash function h to node v is the “tail length” - the number of 0's at the end of the
string. For any set of nodes, an estimate of the size of the set is 2 R , where R is the length of
the longest tail for any member of the set. Thus, instead of storing all the members of the
set, we can instead record only the value of R for that set. Of course, there are many hash
functions, so we need to record values of R for each hash function.
EXAMPLE 10.31 If we use a hash function that produces a 64-bit string, then six bits are all
that are needed to store each value of R . For instance, if there are a billion nodes, and we
want to estimate the size of the neighborhood for each, we can store the value of R for 20
hash functions for each node in 15 gigabytes.
If we store tail lengths for each neighborhood, we can use this information to compute
estimates for the larger neighborhoods from our estimates for the smaller neighborhoods.
That is, suppose we have computed our estimates for | N ( v, d )| for all nodes v , and we want
to compute estimates for the neighborhoods of radius d + 1. For each hash function h , the
value of R for N ( v, d + 1) is the largest of:
(1) The tail of v itself or
(2) The values of R associated with h and N ( u, d ), where v u is an arc of the graph.
Notice that it doesn't matter whether a node is reachable through only one successor of v
in the graph, or through many different successors. We get the same estimate in either case.
This useful property was the same one we exploited in Section 4.4.2 to avoid having to
know whether a stream element appeared once or many times in the stream.
We shall now describe the complete algorithm, called ANF (Approximate Neighborhood
Function). We choose K hash functions h 1 , h 2 , . . . , h k . For each node v and radius d , let
R i ( v, d ) denote the maximum tail length of any node in N ( v, d ) using hash function h i . To
initialize, let R i ( v, 0) be the length of the tail of h i ( v ), for all i and v .
Search WWH ::




Custom Search