MapReduce and the New Software Stack - Mining of Massive Datasets

Database Reference

In-Depth Information

itself, but we first need to imagine that it is two relations, with different schemas, so we

can describe the desired connection as a natural join. Thus, imagine that there are two cop-

ies of Links , namely L 1( U 1 , U 2) and L 2( U 2 , U 3). Now, if we compute L 1 ⋈ L 2, we shall

have exactly what we want. That is, for each tuple t 1 of L 1 (i.e., each tuple of Links ) and

each tuple t 2 of L 2 (another tuple of Links , possibly even the same tuple), see if their U 2

components are the same. Note that these components are the second component of t 1 and

the first component of t 2. If these two components agree, then produce a tuple for the res-

ult, with schema ( U 1 , U 2 , U 3). This tuple consists of the first component of t 1, the second

component of t 1 (which must equal the first component of t 2), and the second component

of t 2.

We may not want the entire path of length two, but only want the pairs ( u, w ) of URL's

such that there is at least one path from u to w of length two. If so, we can project out the

middle components by computing π U 1 , U 3 ( L 1 ⋈ L 2).

□

EXAMPLE 2.5 Imagine that a social-networking site has a relation

Friends(User, Friend)

This relation has tuples that are pairs ( a, b ) such that b is a friend of a . The site might want

to develop statistics about the number of friends members have. Their first step would be

to compute a count of the number of friends of each user. This operation can be done by

grouping and aggregation, specifically

γ U ser , COUNT ( F riend) (Friends)

This operation groups all the tuples by the value in their first component, so there is one

group for each user. Then, for each group the count of the number of friends of that user

is made. The result will be one tuple for each group, and a typical tuple would look like

(Sally , 300), if user “Sally” has 300 friends.

□

2.3.4

Computing Selections by MapReduce

Selections really do not need the full power of MapReduce. They can be done most con-

veniently in the map portion alone, although they could also be done in the reduce portion

alone. Here is a MapReduce implementation of selection σ C ( R ).

x For each tuple t in R , test if it satisfies C . If so, produce the key-value pair ( t, t ). That

is, both the key and value are t .

The Reduce Function The Reduce function is the identity. It simply passes each key-value

pair to the output.

Search WWH ::

Custom Search

Home