Database Reference
In-Depth Information
Map Phase
The Map phase is modeled after the Map operation in functional
programming. In functional programming, Map applies a function of one
argument to each element in a collection and creates a new collection with
all the results. An example would be taking a function ToUpperCase()
and applying it to each element in a string of characters. The result would
be another string of characters that were all capitalized. In functional
programming, functions are “pure,” which means that they do not have any
external side effects.
This lack of side effects may sound like a technicality, but it is actually a
key factor in a MapReduce. When you call Map on a collection, because
the function called on each element doesn't have any inputs other than the
element it is operating on and doesn't change any external state, the order
in which elements are processed doesn't matter. The Map operation might
process the collection in order, in reverse order, or in parallel. The operation
might even run in another process or another machine.
In MapReduce, the Map phase is slightly more general. It runs a Mapper
function over each element in the source data. The Mapper function takes
one element and emits a list of 0 or more key-value pairs. The key is an
address used by the Reduce phase, and the value is the value to be processed
in the Reduce phase. For example, if your MapReduce was going to compute
the number of times each word in the source data appeared, the Mapper
function would take a single line, break it up into words, and emit each word
paired with the number of times it appeared in that line. The Reduce phase
is responsible for collecting the results.
Combine Phase
After the Map phase has started producing values, the Combine phase can
start, if needed. Combine is merely an optimization to save the amount of
I/O that has to be done in the next phases. A Combiner function can take
multiple outputs within the same Map worker and combine them, so less
data needs to be written out to disk before the Shuffle can operate on it.
Shuffle Phase
The Shuffle phase, like the shuffle operation in BigQuery, is just a big sort.
In practice, it tends to be a merge sort because that can be done in parallel.
Search WWH ::




Custom Search