Database Reference
In-Depth Information
of E , each with the same tuples, but with a different schemas. In SQL, this join would be
written using a single relation E ( A, B ) as follows:
SELECT e1.A, e1.B, e2.B
FROM E e1, E e2, E e3
WHERE e1.A = e2.A AND e1.B = e3.A AND e2.B = e3.B
In this query, the equated attributes e 1 .A and e 2 .A are represented in our join by the attribute
X . Also, e 1 .B and e 3 .A are each represented by Y ; e 2 .B and e 3 .B are represented by Z .
Notice that each triangle appears once in this join. The triangle consisting of nodes v 1 ,
v 2 , and v 3 is generated when X , Y , and Z are these three nodes in numerical order, i.e., X <
Y < Z . For instance, if the numerical order of the nodes is v 1 < v 2 < v 3 , then X can only be
v 1 , Y is v 2 , and Z is v 3 .
The technique of Section 2.5.3 can be used to optimize the join of Equation 10.2 . Recall
the ideas in Example 2.9 , where we considered the number of ways in which the values
of each attribute should be hashed. In the present example, the matter is quite simple. The
three occurrences of relation E surely have the same size, so by symmetry, attributes X , Y ,
and Z will each be hashed to the same number of buckets. In particular, if we hash nodes to
b buckets, then there will be b 3 reducers. Each Reduce task is associated with a sequence
of three bucket numbers ( x, y, z ), where each of x , y , and z is in the range 1 to b .
The Map tasks divide the relation E into as many parts as there are Map tasks. Suppose
one Map task is given the tuple E ( u, v ) to send to certain Reduce tasks. First, think of ( u,
v ) as a tuple of the join term E ( X, Y ). We can hash u and v to get the bucket numbers for X
and Y , but we don't know the bucket to which Z hashes. Thus, we must send E ( u, v ) to all
Reducer tasks that correspond to a sequence of three bucket numbers ( h ( u ) , h ( v ) , z ) for any
of the b possible buckets z .
But the same tuple E ( u, v ) must also be treated as a tuple for the term E ( X, Z ). We there-
fore also send the tuple E ( u, v ) to all Reduce tasks that correspond to a triple ( h ( u ) , y, h ( v ))
for any y . Finally, we treat E ( u, v ) as a tuple of the term E ( Y, Z ) and send that tuple to all
Reduce tasks corresponding to a triple ( x, h ( u ) , h ( v )) for any x . The total communication
required is thus 3 b key-value pairs for each of the m tuples of the edge relation E . That is,
the minimum communication cost is O ( mb ) if we use b 3 Reduce tasks.
Next, let us compute the total execution cost at all the Reduce tasks. Assume that the
hash function distributes edges sufficiently randomly that the Reduce tasks each get ap-
proximately the same number of edges. Since the total number of edges distributed to
the b 3 Reduce tasks is O ( mb ), it follows that each task receives O ( m/b 2 ) edges. If we use
the algorithm of Section 10.7.2 at each Reduce task, the total computation at a task is
Search WWH ::




Custom Search