Databases Reference
In-Depth Information
Implementations of Map-Reduce
The original implementation of map-reduce was as an internal and propri-
etary system at Google. It was called simply “Map-Reduce.” There is an
open-source implementation called Hadoop. It can be downloaded, along
with the HDFS distributed file system, from the Apache Foundation.
Reduce tasks is a sequence of (w, m) pairs, where w is a word that appears
at least once among all the input documents and m is the total number of
occurrences of w among all those documents.
2
2.2.4
Combiners
It is common for the Reduce function to be associative and commutative. That
is, the values to be combined can be combined in any order, with the same
result. The addition performed in Example 2.2 is an example of an associative
and commutative operation. It doesn't matter how we group a list of numbers
v 1 , v 2 , . . . , v n ; the sum will be the same.
When the Reduce function is associative and commutative, it is possible to
push some of what Reduce does to the Map tasks. For example, instead of the
Map tasks in Example 2.1 producing many pairs (w, 1), (w, 1), . . . , we could
apply the Reduce function within the Map task, before the output of the Map
tasks is subject to grouping and aggregation. These key-value pairs would thus
be replaced by one pair with key w and value equal to the sum of all the 1's in
all those pairs. That is, the pairs with key w generated by a single Map task
would be combined into a pair (w, m), where m is the number of times that w
appears among the documents handled by this Map task. Note that it is still
necessary to do grouping and aggregation and to pass the result to the Reduce
tasks, since there will typically be one key-value pair with key w coming from
each of the Map tasks.
2.2.5 Details of Map-Reduce Execution
Let us now consider in more detail how a program using map-reduce is executed.
Figure 2.3 offers an outline of how processes, tasks, and files interact. Taking
advantage of a library provided by a map-reduce system such as Hadoop, the
user program forks a Master controller process and some number of Worker
processes at different compute nodes. Normally, a Worker handles either Map
tasks (a Map worker) or Reduce tasks (a Reduce worker), but not both.
The Master has many responsibilities. One is to create some number of
Map tasks and some number of Reduce tasks, these numbers being selected
by the user program. These tasks will be assigned to Worker processes by the
Master. It is reasonable to create one Map task for every chunk of the input
Search WWH ::




Custom Search