Database Reference
In-Depth Information
manually specified in the driver. Such options are useful depending on how the
MapReduce job output will be used in later downstream processing.
The mapper provides the logic to be processed on each data block corresponding
to the specified input files in the driver code. For example, in the word count
MapReduce example provided earlier, a map task is instantiated on a worker node
where a data block resides. Each map task processes a fragment of the text, line by
line, parses a line into words, and emits <word, 1> for each word, regardless of
how many times word appears in the line of text. The key/value pairs are stored
temporarily in the worker node's memory (or cached to the node's disk).
Next, the key/value pairs are processed by the built-in shuffle and sort
functionality based on the number of reducers to be executed. In this simple
example, there is only one reducer. So, all the intermediate data is passed to it.
From the various map task outputs, for each unique key, arrays (lists in Java) of the
associated values in the key/value pairs are constructed. Also, Hadoop ensures that
the keys are passed to each reducer in sorted order. In Figure 10.3 , <each,(1,1)>
is the first key/value pair processed, followed alphabetically by <For,(1)> and
the rest of the key/value pairs until the last key/value pair is passed to the reducer.
The ( ) denotes a list of values which, in this case, is just an array of ones.
Figure 10.3 Shuffle and sort
In general, each reducer processes the values for each key and emits a key/value
pair as defined by the reduce logic. The output is then stored in HDFS like any
other file in, say, 64 MB blocks replicated three times across the nodes.
Search WWH ::




Custom Search