Mining Social-Network Graphs - Mining of Massive Datasets

Database Reference

In-Depth Information

find the size of the reachable set for each node v , i.e., the set we have called N ( v, ∞). Recall

that for a graph of a billion nodes, it is totally infeasible to compute the neighborhoods for

each node, even using a very large cluster of machines. However, even if we want only a

count of the nodes in each neighborhood, we need to remember the nodes found so far as

we explore the graph, or else we shall not know whether a found node is new or one we

have seen already.

On the other hand, it is not so hard to find an approximation to the size of each neigh-

borhood, using the Flajolet-Martin technique discussed in Section 4.4.2 . Recall that this

technique uses some large number of hash functions; in this case, the hash functions are

applied to the nodes of the graph. The important property of the bit string we get when we

apply hash function h to node v is the “tail length” - the number of 0's at the end of the

string. For any set of nodes, an estimate of the size of the set is 2 R , where R is the length of

the longest tail for any member of the set. Thus, instead of storing all the members of the

set, we can instead record only the value of R for that set. Of course, there are many hash

functions, so we need to record values of R for each hash function.

EXAMPLE 10.31 If we use a hash function that produces a 64-bit string, then six bits are all

that are needed to store each value of R . For instance, if there are a billion nodes, and we

want to estimate the size of the neighborhood for each, we can store the value of R for 20

hash functions for each node in 15 gigabytes.

□

If we store tail lengths for each neighborhood, we can use this information to compute

estimates for the larger neighborhoods from our estimates for the smaller neighborhoods.

That is, suppose we have computed our estimates for | N ( v, d )| for all nodes v , and we want

to compute estimates for the neighborhoods of radius d + 1. For each hash function h , the

value of R for N ( v, d + 1) is the largest of:

(1) The tail of v itself or

(2) The values of R associated with h and N ( u, d ), where v → u is an arc of the graph.

Notice that it doesn't matter whether a node is reachable through only one successor of v

in the graph, or through many different successors. We get the same estimate in either case.

This useful property was the same one we exploited in Section 4.4.2 to avoid having to

know whether a stream element appeared once or many times in the stream.

We shall now describe the complete algorithm, called ANF (Approximate Neighborhood

Function). We choose K hash functions h 1 , h 2 , . . . , h k . For each node v and radius d , let

R i ( v, d ) denote the maximum tail length of any node in N ( v, d ) using hash function h i . To

initialize, let R i ( v, 0) be the length of the tail of h i ( v ), for all i and v .

Search WWH ::

Custom Search

Home