Databases Reference
In-Depth Information
Counting was one easy function, but now it's been split up into two
functions. In general, converting an algorithm into a series of Map‐
Reduce steps is often unintuitive.
For the preceding word count, distribution needs to be uniform. If all
your words are the same, they all go to one machine during the shuffle,
which causes huge problems. Google has solved this using hash buck‐
ets heaps in the mappers in one MapReduce iteration. It's called
CountSketch , and it is built to handle odd datasets.
At Google there's a real-time monitor for MapReduce jobs, a box with
shards that correspond to pieces of work on a machine. It indicates
through a bar chart how the various machines are doing. If all the
mappers are running well, you'd see a straight line across. Usually,
however, everything goes wrong in the reduce step due to nonuni‐
formity of the data; e.g., lots of values on one key.
The data preparation and writing the output, which take place behind
the scenes, take a long time, so it's good to try to do everything in one
iteration. Note we're assuming a distributed filesystem is already there
—indeed we have to use MapReduce to get data to the distributed
filesystem—once we start using MapReduce, we can't stop.
Once you get into the optimization process, you find yourself tuning
MapReduce jobs to shave off nanoseconds 10 −9 while processing pe‐
tabytes of data. These are order shifts worthy of physicists. This opti‐
mization is almost all done in C++. It's highly optimized code, and we
try to scrape out every ounce of power we can.
Other Examples of MapReduce
Counting words is the most basic example of MapReduce. Let's look
at another to start getting more of a feel for it. The key attribute of a
problem that can be solved with MapReduce is that the data can be
distributed among many computers and the algorithm can treat each
of those computers separately, i.e., one computer doesn't need to know
what's going on with any other computer.
Here's another example where you could use MapReduce. Let's say you
had tons of timestamped event data and logs of users' actions on a
website. For each user, you might have {user_id, IP_address, zip
code, ad_they_saw, did_they_click} . Suppose you wanted to
count how many unique users saw ads from each zip code and how
many clicked at least once.
Search WWH ::




Custom Search