Database Reference
In-Depth Information
O (( m/b 2 ) 3/2 ), or O ( m 3/2 /b 3 ). Since there are b 3 Reduce tasks, the total computation cost is
O ( m 3/2 ), exactly as for the one-processor algorithm of Section 10.7.2 .
10.7.5
Using Fewer Reduce Tasks
By a judicious ordering of the nodes, we can lower the number of Reduce tasks by approx-
imately a factor of 6. Think of the “name” of the node i as the pair ( h ( i ) , i ), where h is the
hash function that we used in Section 10.7.4 to hash nodes to b buckets. Order nodes by
their name, considering only the first component (i.e., the bucket to which the node hashes),
and only using the second component to break ties among nodes that are in the same buck-
et.
If we use this ordering of nodes, then the Reduce task corresponding to the list of buckets
( i, j, k ) will be needed only if i j k . If b is large, then approximately 1/6 of all b 3 se-
quences of integers, each in the range 1 to b , will satisfy these inequalities. For any b , the
number of such sequences is
(see Exercise 10.7.4 ) . Thus, the exact ratio is ( b + 2)( b
+ 1)/(6 b 2 ).
As there are fewer reducers, we get a substantial decrease in the number of key-value
pairs that must be communicated. Instead of having to send each of the m edges to 3 b Re-
duce tasks, we need to send each edge only to b tasks. Specifically, consider an edge e
whose two nodes hash to i and j ; these buckets could be the same or different. For each of
the b values of k between 1 and b , consider the list formed from i , j , and k in sorted order.
Then the Reduce task that corresponds to this list requires the edge e . But no other Reduce
tasks require e .
To compare the communication cost of the method of this section with that of Section
10.7.4 , let us fix the number of Reduce tasks, say k . Then the method of Section 10.7.4
hashes nodes to buckets, and therefore communicates key-value pairs. On the other
hand, the method of this section hashes nodes to approximately buckets, thus requiring
communications. Thus, the ratio of the communication needed by the method of Sec-
tion 10.7.4 to what is needed here is
EXAMPLE 10.25 Consider the straightforward algorithm of Section 10.7.4 with b = 6. That
is, there are b 3 = 216 Reduce tasks and the communication cost is 3 mb = 18 m . We cannot
use exactly 216 Reduce tasks with the method of this section, but we can come very close
if we choose b = 10. Then, the number of Reduce tasks is and the communication cost
is mb = 10 m . That is, the communication cost is 5/9th of the cost of the straightforward
method.
Search WWH ::




Custom Search