Database Reference
In-Depth Information
Figure 10.1
illustrates the MapReduce processing for a single input—in this case, a
line of text.
Figure 10.1
Example of how MapReduce works
In this example, the map step parses the provided text string into individual words
and emits a set of key/value pairs of the form
<word, 1>
. For each unique key—in
this example,
word
—the reduce step sums the
1
values and outputs the
<word,
count>
key/value pairs. Because the word
each
appeared twice in the given line
of text, the reduce step provides a corresponding key/value pair of
<each, 2>
.
It should be noted that, in this example, the original key,
1234
, is ignored in the
processing. In a typical word count application, the map step may be applied to
millions of lines of text, and the reduce step will summarize the key/value pairs
generated by all the map steps.
Expanding on the word count example, the final output of a MapReduce process
applied to a set of documents might have the
key
as an ordered pair and the
value
as an ordered tuple of length 2n. A possible representation of such a key/value pair
follows:
<(filename, datetime),(word1,5, word2,7,… , wordn,6)>
In this construction, the key is the ordered pair
filename
and
datetime
. The
value consists of the n pairs of the words and their individual counts in the
corresponding file.
Of course, a word count problem could be addressed in many ways other than
MapReduce. However, MapReduce has the advantage of being able to distribute
the workload over a cluster of computers and run the tasks in parallel. In a word
count, the documents, or even pieces of the documents, could be processed