MapReduce and the New Software Stack - Mining of Massive Datasets

Database Reference

In-Depth Information

by hashing the keys and associating each Reduce task with one of the buckets of the hash

function.

EXAMPLE 2.2 Let us continue with the word-count example of Example 2.1 . The Reduce

function simply adds up all the values. The output of a reducer consists of the word and the

sum. Thus, the output of all the Reduce tasks is a sequence of ( w, m ) pairs, where w is a

word that appears at least once among all the input documents and m is the total number of

occurrences of w among all those documents.

□

2.2.4

Combiners

Sometimes, a Reduce function is associative and commutative. That is, the values to be

combined can be combined in any order, with the same result. The addition performed in

Example 2.2 is an example of an associative and commutative operation. It doesn't matter

how we group a list of numbers v 1 , v 2 , . . . , v n ; the sum will be the same.

When the Reduce function is associative and commutative, we can push some of what

the reducers do to the Map tasks. For example, instead of the Map tasks in Example 2.1

producing many pairs ( w, 1) , ( w, 1) , . . . , we could apply the Reduce function within the

Map task, before the output of the Map tasks is subject to grouping and aggregation. These

key-value pairs would thus be replaced by one pair with key w and value equal to the sum

of all the 1's in all those pairs. That is, the pairs with key w generated by a single Map task

would be replaced by a pair ( w, m ), where m is the number of times that w appears among

the documents handled by this Map task. Note that it is still necessary to do grouping and

aggregation and to pass the result to the Reduce tasks, since there will typically be one key-

value pair with key w coming from each of the Map tasks.

2.2.5

Details of MapReduce Execution

Let us now consider in more detail how a program using MapReduce is executed. Figure

2.3 offers an outline of how processes, tasks, and files interact. Taking advantage of a lib-

rary provided by a MapReduce system such as Hadoop, the user program forks a Master

controller process and some number of Worker processes at different compute nodes.

Normally, a Worker handles either Map tasks (a Map worker ) or Reduce tasks (a Reduce

worker ), but not both.

Search WWH ::

Custom Search

Home