Large-Scale File Systems and Map-Reduce - Mining of Massive Datasets

Databases Reference

In-Depth Information

2.3.7 Computing Natural Join by Map-Reduce

The idea behind implementing natural join via map-reduce can be seen if we

look at the specific case of joining R(A, B) with S(B, C). We must find tuples

that agree on their B components, that is the second component from tuples

of R and the first component of tuples of S. We shall use the B-value of tuples

from either relation as the key. The value will be the other component and the

name of the relation, so the Reduce function can know where each tuple came

from.

The Map Function: For each tuple (a, b) of R, produce the key-value pair

.

The Reduce Function: Each key value b will be associated with a list of

pairs that are either of the form (R, a) or (S, c). Construct all pairs consisting

of one with first component R and the other with first component S, say (R, a)

and (S, c). The output for key b is (b, [(a 1 , b, c 1 ), (a 2 , b, c 2 ), . . . ]), that is, b

associated with the list of tuples that can be formed from an R-tuple and an

S-tuple with a common b value.

. For each tuple (b, c) of S, produce the key-value pair

b, (S, c)

b, (R, a)

There are a few observations we should make about this join algorithm.

First, the relation that is the result of the join is recovered by taking all the

tuples that appear on the lists for any key. Second, map-reduce implementations

such as Hadoop pass values to the Reduce tasks sorted by key. If so, then

identifying all the tuples from both relations that have key b is easy. If another

implementation were not to provide key-value pairs sorted by key, then the

Reduce function could still manage its task e ciently by hashing key-value

pairs locally by key. If enough buckets were used, most buckets would have

only one key. Finally, if there are n tuples of R with B-value b and m tuples

from S with B-value b, then there are mn tuples with middle component b in

the result. In the extreme case, all tuples from R and S have the same b-value,

and we are really taking a Cartesian product. However, it is quite common for

the number of tuples with shared B-values to be small, and in that case, the

time complexity of the Reduce function is closer to linear in the relation sizes

than to quadratic.

2.3.8 Generalizing the Join Algorithm

The same algorithm works if the relations have more than two attributes. You

can think of A as representing all those attributes in the schema of R but not

S. B represents the attributes in both schemas, and C represents attributes

only in the schema of S. The key for a tuple of R or S is the list of values in all

the attributes that are in the schemas of both R and S. The value for a tuple

of R is the name R and the values of all the attributes of R but not S, and the

value for a tuple of S is the name S and the values of the attributes of S but

not R.

The Reduce function looks at all the key-value pairs with a given key and

combines those values from R with those values of S in all possible ways. From

Mining of Massive Datasets

Search WWH ::

Custom Search

Home