Database Reference
In-Depth Information
(2) The key-value pairs from each Map task are collected by a master controller and sor-
ted by key. The keys are divided among all the Reduce tasks, so all key-value pairs
with the same key wind up at the same Reduce task.
(3) The Reduce tasks work on one key at a time, and combine all the values associated
with that key in some way. The manner of combination of values is determined by the
code written by the user for the Reduce function.
Figure 2.2 suggests this computation.
Figure 2.2 Schematic of a MapReduce computation
2.2.1
The Map Tasks
We view input files for a Map task as consisting of elements , which can be any type: a tuple
or a document, for example. A chunk is a collection of elements, and no element is stored
across two chunks. Technically, all inputs to Map tasks and outputs from Reduce tasks are
of the key-value-pair form, but normally the keys of input elements are not relevant and we
shall tend to ignore them. Insisting on this form for inputs and outputs is motivated by the
desire to allow composition of several MapReduce processes.
The Map function takes an input element as its argument and produces zero or more key-
value pairs. The types of keys and values are each arbitrary. Further, keys are not “keys”
in the usual sense; they do not have to be unique. Rather a Map task can produce several
key-value pairs with the same key, even from the same element.
EXAMPLE 2.1 We shall illustrate a MapReduce computation with what has become the
standard example application: counting the number of occurrences for each word in a col-
lection of documents. In this example, the input file is a repository of documents, and each
document is an element. The Map function for this example uses keys that are of type
String (the words) and values that are integers. The Map task reads a document and breaks
it into its sequence of words w 1 , w 2 , . . . , w n . It then emits a sequence of key-value pairs
where the value is always 1. That is, the output of the Map task for this document is the
sequence of key-value pairs:
Search WWH ::




Custom Search