Databases Reference
In-Depth Information
Example 2.9 : Suppose that b = c = 4, so k = 16. The sixteen Reduce tasks
can be thought of as arranged in a rectangle, as suggested by Fig. 2.8. There,
we see a hypothetical S-tuple S(v, w) for which h(v) = 2 and g(w) = 1. This
tuple is sent by its Map task only to the Reduce task (2, 1). We also see an
R-tuple R(u, v). Since h(v) = 2, this tuple is sent to all Reduce tasks (2, y), for
y = 1, 2, 3, 4. Finally, we see a T -tuple T (w, x). Since g(w) = 1, this tuple is
sent to all Reduce tasks (z, 1) for z = 1, 2, 3, 4. Notice that these three tuples
join, and they meet at exactly one Reduce task, the task numbered (2, 1).
2
Now, suppose that the sizes of R, S, and T are different; recall we use r,
s, and t, respectively, for those sizes. If we hash B-values to b buckets and
C-values to c buckets, where bc = k, then the total communication cost for
moving the tuples to the proper Reduce task is the sum of:
1. s to move each tuple S(v, w) once to the Reduce task
h(v), g(w)
.
h(v), y
2. cr to move each tuple R(u, v) to the c Reduce tasks
for each of
the c possible values of y.
z, g(w)
3. bt to move each tuple T (w, x) to the b Reduce tasks
for each of
the b possible values of z.
There is also a cost r + s + t to make each tuple of each relation be input to
one of the Map tasks. This cost is fixed, independent of b, c, and k.
We must select b and c, subject to the constraint bc = k, to minimize
s + cr + bt. We shall use the technique of Lagrangean multipliers to find the
place where the function s + cr + bt−λ(bc−k) has its derivatives with respect
to b and c equal to 0. That is, we must solve the equations r−λb = 0 and
t−λc = 0. Since r = λb and t = λc, we may multiply corresponding sides of
these equations to get rt = λ 2 bc. Since bc = k, we get rt = λ 2 k, or λ =
rt/k.
Thus, the minimum communication cost is obtained when c = t/λ =
kt/r,
and b = r/λ =
kr/t.
If we substitute these values into the formula s + cr + bt, we get s + 2
krt.
That is the communication cost for the Reduce tasks, to which we must add
the cost s + r + t for the communication cost of the Map tasks. The latter term
typically will be smaller than the first term by a factor O(
k), so we can neglect
it in most situations, as long as the number of Reduce tasks is reasonably large.
Example 2.10 : Let us see under what circumstances the 3-way join has lower
communication cost than the cascade of two 2-way joins. To make matters
simple, let us assume that R, S, and T are all the same relation R, which
represents the “friends” relation in a social network like Facebook. There are
roughly 300,000,000 subscribers on Facebook, with an average of 300 friends
each, so relation R has r = 9×10 10 tuples. Suppose we want to compute
R ⊲⊳ R ⊲⊳ R, perhaps as part of a calculation to find the number of friends
of friends of friends each subscriber has, or perhaps just the person with the
Search WWH ::




Custom Search