Databases Reference
In-Depth Information
Enter MapReduce
In 2004 Jeff and Sanjay published their paper “MapReduce: Simplified
Data Processing on Large Clusters” (and here's another one on the
underlying filesystem ).
MapReduce allows us to stop thinking about fault tolerance; it is a
platform that does the fault tolerance work for us. Programming 1,000
computers is now easier than programming 100. It's a library to do
fancy things.
To use MapReduce, you write two functions: a mapper function, and
then a reducer function. It takes these functions and runs them on
many machines that are local to your stored data. All of the fault tol‐
erance is automatically done for you once you've placed the algorithm
into the map/reduce framework.
The mapper takes each data point and produces an ordered pair of the
form (key, value). The framework then sorts the outputs via the “shuf‐
fle,” and in particular finds all the keys that match and puts them to‐
gether in a pile. Then it sends these piles to machines that process them
using the reducer function. The reducer function's outputs are of the
form (key, new value), where the new value is some aggregate function
of the old values.
So how do we do it for our word counting algorithm? For each word,
just send it to the ordered pair with the key as that word and the value
being the integer 1. So:
red ---> ("red", 1)
blue ---> ("blue", 1)
red ---> ("red", 1)
Then they go into the “shuffle” (via the “fan-in”) and we get a pile of
(“red”, 1)'s, which we can rewrite as (“red”, 1, 1). This gets sent to the
reducer function, which just adds up all the 1's. We end up with (“red”,
2), (“blue”, 1).
The key point is: one reducer handles all the values for a fixed key .
Got more data? Increase the number of map workers and reduce
workers. In other words, do it on more computers. MapReduce flattens
the complexity of working with many computers. It's elegant, and
people use it even when they shouldn't (although, at Google it's not so
crazy to assume your data could grow by a factor of 100 overnight).
Like all tools, it gets overused.
Search WWH ::




Custom Search