Database Reference
In-Depth Information
Figure 2.8
Sixteen reducers together perform a 3-way join
Computation Cost of the 3-Way Join
Each of the reducers must join of parts of the three relations, and it is reasonable to ask whether this join can be taken
in time that is linear in the size of the input to that Reduce task. While more complex joins might not be computable
in linear time, the join of our running example can be executed at each Reduce process efficiently. First, create an
index on
R.B
, to organize the
R
-tuples received. Likewise, create an index on
T.C
for the
T
-tuples. Then, consider each
received
S
-tuple,
S
(
v, w
). Use the index on
R.B
to find all
R
-tuples with
R.B
=
v
and use the index on
T.C
to find all
T
-tuples with
T.C
=
w
.
Now, suppose that the sizes of
R
,
S
, and
T
are different; recall we use
r
,
s
, and
t
, respect-
ively, for those sizes. If we hash
B
-values to
b
buckets and
C
-values to
c
buckets, where
bc
=
k
, then the total communication cost for moving the tuples to the proper reducers is the
sum of:
(1)
s
to move each tuple
S
(
v, w
) once to the reducer (
h
(
v
)
, g
(
w
)).
(2)
cr
to move each tuple
R
(
u, v
) to the
c
reducers (
h
(
v
)
, y
) for each of the
c
possible values
of
y
.
(3)
bt
to move each tuple
T
(
w, x
) to the
b
reducers (
z, g
(
w
)) for each of the
b
possible val-
ues of
z
.
There is also a cost
r
+
s
+
t
to make each tuple of each relation be input to one of the Map
tasks. This cost is fixed, independent of
b
,
c
, and
k
.
We must select
b
and
c
, subject to the constraint
bc
=
k
, to minimize
s
+
cr
+
bt
. We shall
use the technique of Lagrangean multipliers to find the place where the function
s
+
cr
+
bt
−
λ
(
bc
−
k
) has its derivatives with respect to
b
and
c
equal to 0. That is, we must solve the
equations
r
−
λb
= 0 and
t
−
λc
= 0. Since
r
=
λb
and
t
=
λc
, we may multiply corresponding
sides of these equations to get
rt
=
λ
2
bc
. Since
bc
=
k
, we get
rt
=
λ
2
k
, or
Thus, the
minimum communication cost is obtained when and
If we substitute these values into the formula
s
+
cr
+
bt
, we get That is the communic-
ation cost for the Reduce tasks, to which we must add the cost
s
+
r
+
t
for the communic-
ation cost of the Map tasks. The total communication cost is thus
In most circum-
stances, we can neglect
r
+
t
, because it will be less than
usually by a factor of