Databases Reference
In-Depth Information
2.3.7 Computing Natural Join by Map-Reduce
The idea behind implementing natural join via map-reduce can be seen if we
look at the specific case of joining R(A, B) with S(B, C). We must find tuples
that agree on their B components, that is the second component from tuples
of R and the first component of tuples of S. We shall use the B-value of tuples
from either relation as the key. The value will be the other component and the
name of the relation, so the Reduce function can know where each tuple came
from.
The Map Function: For each tuple (a, b) of R, produce the key-value pair
.
The Reduce Function: Each key value b will be associated with a list of
pairs that are either of the form (R, a) or (S, c). Construct all pairs consisting
of one with first component R and the other with first component S, say (R, a)
and (S, c). The output for key b is (b, [(a 1 , b, c 1 ), (a 2 , b, c 2 ), . . . ]), that is, b
associated with the list of tuples that can be formed from an R-tuple and an
S-tuple with a common b value.
. For each tuple (b, c) of S, produce the key-value pair
b, (S, c)
b, (R, a)
There are a few observations we should make about this join algorithm.
First, the relation that is the result of the join is recovered by taking all the
tuples that appear on the lists for any key. Second, map-reduce implementations
such as Hadoop pass values to the Reduce tasks sorted by key. If so, then
identifying all the tuples from both relations that have key b is easy. If another
implementation were not to provide key-value pairs sorted by key, then the
Reduce function could still manage its task e ciently by hashing key-value
pairs locally by key. If enough buckets were used, most buckets would have
only one key. Finally, if there are n tuples of R with B-value b and m tuples
from S with B-value b, then there are mn tuples with middle component b in
the result. In the extreme case, all tuples from R and S have the same b-value,
and we are really taking a Cartesian product. However, it is quite common for
the number of tuples with shared B-values to be small, and in that case, the
time complexity of the Reduce function is closer to linear in the relation sizes
than to quadratic.
2.3.8 Generalizing the Join Algorithm
The same algorithm works if the relations have more than two attributes. You
can think of A as representing all those attributes in the schema of R but not
S. B represents the attributes in both schemas, and C represents attributes
only in the schema of S. The key for a tuple of R or S is the list of values in all
the attributes that are in the schemas of both R and S. The value for a tuple
of R is the name R and the values of all the attributes of R but not S, and the
value for a tuple of S is the name S and the values of the attributes of S but
not R.
The Reduce function looks at all the key-value pairs with a given key and
combines those values from R with those values of S in all possible ways. From
Search WWH ::




Custom Search