Database Reference
In-Depth Information
by hashing the keys and associating each Reduce task with one of the buckets of the hash
function.
EXAMPLE 2.2 Let us continue with the word-count example of Example 2.1 . The Reduce
function simply adds up all the values. The output of a reducer consists of the word and the
sum. Thus, the output of all the Reduce tasks is a sequence of ( w, m ) pairs, where w is a
word that appears at least once among all the input documents and m is the total number of
occurrences of w among all those documents.
2.2.4
Combiners
Sometimes, a Reduce function is associative and commutative. That is, the values to be
combined can be combined in any order, with the same result. The addition performed in
Example 2.2 is an example of an associative and commutative operation. It doesn't matter
how we group a list of numbers v 1 , v 2 , . . . , v n ; the sum will be the same.
When the Reduce function is associative and commutative, we can push some of what
the reducers do to the Map tasks. For example, instead of the Map tasks in Example 2.1
producing many pairs ( w, 1) , ( w, 1) , . . . , we could apply the Reduce function within the
Map task, before the output of the Map tasks is subject to grouping and aggregation. These
key-value pairs would thus be replaced by one pair with key w and value equal to the sum
of all the 1's in all those pairs. That is, the pairs with key w generated by a single Map task
would be replaced by a pair ( w, m ), where m is the number of times that w appears among
the documents handled by this Map task. Note that it is still necessary to do grouping and
aggregation and to pass the result to the Reduce tasks, since there will typically be one key-
value pair with key w coming from each of the Map tasks.
2.2.5
Details of MapReduce Execution
Let us now consider in more detail how a program using MapReduce is executed. Figure
2.3 offers an outline of how processes, tasks, and files interact. Taking advantage of a lib-
rary provided by a MapReduce system such as Hadoop, the user program forks a Master
controller process and some number of Worker processes at different compute nodes.
Normally, a Worker handles either Map tasks (a Map worker ) or Reduce tasks (a Reduce
worker ), but not both.
Search WWH ::




Custom Search