Large-Scale File Systems and Map-Reduce - Mining of Massive Datasets

Databases Reference

In-Depth Information

Implementations of Map-Reduce

The original implementation of map-reduce was as an internal and propri-

etary system at Google. It was called simply “Map-Reduce.” There is an

open-source implementation called Hadoop. It can be downloaded, along

with the HDFS distributed file system, from the Apache Foundation.

Reduce tasks is a sequence of (w, m) pairs, where w is a word that appears

at least once among all the input documents and m is the total number of

occurrences of w among all those documents.

2

2.2.4

Combiners

It is common for the Reduce function to be associative and commutative. That

is, the values to be combined can be combined in any order, with the same

result. The addition performed in Example 2.2 is an example of an associative

and commutative operation. It doesn't matter how we group a list of numbers

v 1 , v 2 , . . . , v n ; the sum will be the same.

When the Reduce function is associative and commutative, it is possible to

push some of what Reduce does to the Map tasks. For example, instead of the

Map tasks in Example 2.1 producing many pairs (w, 1), (w, 1), . . . , we could

apply the Reduce function within the Map task, before the output of the Map

tasks is subject to grouping and aggregation. These key-value pairs would thus

be replaced by one pair with key w and value equal to the sum of all the 1's in

all those pairs. That is, the pairs with key w generated by a single Map task

would be combined into a pair (w, m), where m is the number of times that w

appears among the documents handled by this Map task. Note that it is still

necessary to do grouping and aggregation and to pass the result to the Reduce

tasks, since there will typically be one key-value pair with key w coming from

each of the Map tasks.

2.2.5 Details of Map-Reduce Execution

Let us now consider in more detail how a program using map-reduce is executed.

Figure 2.3 offers an outline of how processes, tasks, and files interact. Taking

advantage of a library provided by a map-reduce system such as Hadoop, the

user program forks a Master controller process and some number of Worker

processes at different compute nodes. Normally, a Worker handles either Map

tasks (a Map worker) or Reduce tasks (a Reduce worker), but not both.

The Master has many responsibilities. One is to create some number of

Map tasks and some number of Reduce tasks, these numbers being selected

by the user program. These tasks will be assigned to Worker processes by the

Master. It is reasonable to create one Map task for every chunk of the input

Search WWH ::

Custom Search

Home