Mining Social-Network Graphs - Mining of Massive Datasets

Database Reference

In-Depth Information

of E , each with the same tuples, but with a different schemas. In SQL, this join would be

written using a single relation E ( A, B ) as follows:

SELECT e1.A, e1.B, e2.B

FROM E e1, E e2, E e3

WHERE e1.A = e2.A AND e1.B = e3.A AND e2.B = e3.B

In this query, the equated attributes e 1 .A and e 2 .A are represented in our join by the attribute

X . Also, e 1 .B and e 3 .A are each represented by Y ; e 2 .B and e 3 .B are represented by Z .

Notice that each triangle appears once in this join. The triangle consisting of nodes v 1 ,

v 2 , and v 3 is generated when X , Y , and Z are these three nodes in numerical order, i.e., X <

Y < Z . For instance, if the numerical order of the nodes is v 1 < v 2 < v 3 , then X can only be

v 1 , Y is v 2 , and Z is v 3 .

The technique of Section 2.5.3 can be used to optimize the join of Equation 10.2 . Recall

the ideas in Example 2.9 , where we considered the number of ways in which the values

of each attribute should be hashed. In the present example, the matter is quite simple. The

three occurrences of relation E surely have the same size, so by symmetry, attributes X , Y ,

and Z will each be hashed to the same number of buckets. In particular, if we hash nodes to

b buckets, then there will be b 3 reducers. Each Reduce task is associated with a sequence

of three bucket numbers ( x, y, z ), where each of x , y , and z is in the range 1 to b .

The Map tasks divide the relation E into as many parts as there are Map tasks. Suppose

one Map task is given the tuple E ( u, v ) to send to certain Reduce tasks. First, think of ( u,

v ) as a tuple of the join term E ( X, Y ). We can hash u and v to get the bucket numbers for X

and Y , but we don't know the bucket to which Z hashes. Thus, we must send E ( u, v ) to all

Reducer tasks that correspond to a sequence of three bucket numbers ( h ( u ) , h ( v ) , z ) for any

of the b possible buckets z .

But the same tuple E ( u, v ) must also be treated as a tuple for the term E ( X, Z ). We there-

fore also send the tuple E ( u, v ) to all Reduce tasks that correspond to a triple ( h ( u ) , y, h ( v ))

for any y . Finally, we treat E ( u, v ) as a tuple of the term E ( Y, Z ) and send that tuple to all

Reduce tasks corresponding to a triple ( x, h ( u ) , h ( v )) for any x . The total communication

required is thus 3 b key-value pairs for each of the m tuples of the edge relation E . That is,

the minimum communication cost is O ( mb ) if we use b 3 Reduce tasks.

Next, let us compute the total execution cost at all the Reduce tasks. Assume that the

hash function distributes edges sufficiently randomly that the Reduce tasks each get ap-

proximately the same number of edges. Since the total number of edges distributed to

the b 3 Reduce tasks is O ( mb ), it follows that each task receives O ( m/b 2 ) edges. If we use

the algorithm of Section 10.7.2 at each Reduce task, the total computation at a task is

Search WWH ::

Custom Search

Home