Database Reference
In-Depth Information
Here, we shall look only at the join
R
(
A, B
)
⋈
S
(
B, C
)
⋈
T
(
C, D
) as an example. Suppose
that the relations
R
,
S
, and
T
have sizes
r
,
s
, and
t
, respectively, and for simplicity, suppose
that
p
is the probability that:
(1) an
R
-tuple and and
S
-tuple agree on
B
; and also the probability that
(2) an
S
-tuple and a
T
-tuple agree on
C
.
If we join
R
and
S
first, using the MapReduce algorithm of
Section 2.3.7
, then the com-
munication cost is
O
(
r
+
s
), and the size of the intermediate join
R
⋈
S
is
prs
. When we join
this result with
T
, the communication of this second MapReduce job is
O
(
t
+
prs
). Thus,
the entire communication cost of the algorithm consisting of two 2-way joins is
O
(
r
+
s
+
t
+
prs
). If we instead join
S
and
T
first, and then join
R
with the result, we get another
algorithm whose communication cost is
O
(
r
+
s
+
t
+
pst
).
A third way to take this join is to use a single MapReduce job that joins the three re-
lations at once. Suppose that we plan to use
k
reducers for this job. Pick numbers
b
and
c
representing the number of buckets into which we shall hash
B
- and
C
-values, respect-
ively. Let
h
be a hash function that sends
B
-values into
b
buckets, and let
g
be another hash
function that sends
C
-values into
c
buckets. We require that
bc
=
k
; that is, each reducer
corresponds to a pair of buckets, one for the
B
-value and one for the
C
-value. The reducer
corresponding to bucket pair (
i, j
) is responsible for joining the tuples
R
(
u, v
),
S
(
v, w
), and
T
(
w, x
) whenever
h
(
v
) =
i
and
g
(
w
) =
j
.
As a result, the Map tasks that send tuples of
R
,
S
, and
T
to the reducers that need them
must send
R
- and
T
-tuples to more than one reducer. For an
S
-tuple
S
(
v, w
), we know the
B
- and
C
-values, so we can send this tuple only to the reducer for (
h
(
v
)
, g
(
w
)). However,
consider an
R
-tuple
R
(
u, v
). We know it only needs to go to reducers that correspond to
(
h
(
v
)
, y
), for some
y
. But we don't know
y
; the value of
C
could be anything as far as we
know. Thus, we must send
R
(
u, v
) to
c
reducers, since
y
could be any of the
c
buckets for
C
-values. Similarly, we must send the
T
-tuple
T
(
w, x
) to each of the reducers (
z, g
(
w
)) for
any
z
. There are
b
such reducers.
EXAMPLE
2.9 Suppose that
b
=
c
= 4, so
k
= 16. The sixteen reducers can be thought of as
arranged in a rectangle, as suggested by
Fig. 2.8
. There, we see a hypothetical
S
-tuple
S
(
v,
w
) for which
h
(
v
) = 2 and
g
(
w
) = 1. This tuple is sent by its Map task only to the reducer for
key (2
,
1). We also see an
R
-tuple
R
(
u, v
). Since
h
(
v
) = 2, this tuple is sent to all reducers
(2
, y
), for
y
= 1
,
2
,
3
,
4. Finally, we see a
T
-tuple
T
(
w, x
). Since
g
(
w
) = 1, this tuple is sent
to all reducers (
z,
1) for
z
= 1
,
2
,
3
,
4. Notice that these three tuples join, and they meet at
exactly one reducer, the reducer for key (2
,
1).
□