Large-Scale File Systems and Map-Reduce - Mining of Massive Datasets

Databases Reference

In-Depth Information

each pairing, the tuple produced has the values from R, the key values, and the

values from S.

2.3.9 Grouping and Aggregation by Map-Reduce

As with the join, we shall discuss the minimal example of grouping and aggrega-

tion, where there is one grouping attribute and one aggregation. Let R(A, B, C)

be a relation to which we apply the operator γ A,θ(B) (R). Map will perform the

grouping, while Reduce does the aggregation.

The Map Function: For each tuple (a, b, c) produce the key-value pair (a, b).

The Reduce Function: Each key a represents a group. Apply the aggregation

operator θ to the list [b 1 , b 2 , . . . , b n ] of B-values associated with key a. The

output is the pair (a, x), where x is the result of applying θ to the list. For

example, if θ is SUM, then x = b 1 + b 2 ++ b n , and if θ is MAX, then x is

the largest of b 1 , b 2 , . . . , b n .

If there are several grouping attributes, then the key is the list of the values

of a tuple for all these attributes. If there is more than one aggregation, then

the Reduce function applies each of them to the list of values associated with

a given key and produces a tuple consisting of the key, including components

for all grouping attributes if there is more than one, followed by the results of

each of the aggregations.

2.3.10 Matrix Multiplication

If M is a matrix with element m ij in row i and column j, and N is a matrix

with element n jk in row j and column k, then the product P = M N is the

matrix P with element p ik in row i and column k, where

p ik =

m ij n jk

j

It is required that the number of columns of M equals the number of rows of

N , so the sum over j makes sense.

We can think of a matrix as a relation with three attributes: the row number,

the column number, and the value in that row and column. Thus, we could

view matrix M as a relation M (I, J, V ), with tuples (i, j, m ij ) and we could view

matrix N as a relation N (J, K, W ), with tuples (j, k, n jk ). As large matrices are

often sparse (mostly 0's), and since we can omit the tuples for matrix elements

that are 0, this relational representation is often a very good one for a large

matrix. However, it is possible that i, j, and k are implicit in the position of a

matrix element in the file that represents it, rather than written explicitly with

the element itself. In that case, the Map function will have to be designed to

construct the I, J , and K components of tuples from the position of the data.

The product M N is almost a natural join followed by grouping and ag-

gregation.

That is, the natural join of M (I, J, V ) and N (J, K, W ), having

Search WWH ::

Custom Search

Home