Data Engineering: MapReduce, Pregel, and Hadoop - Doing Data Science

Databases Reference

In-Depth Information

Counting was one easy function, but now it's been split up into two

functions. In general, converting an algorithm into a series of Map‐

Reduce steps is often unintuitive.

For the preceding word count, distribution needs to be uniform. If all

your words are the same, they all go to one machine during the shuffle,

which causes huge problems. Google has solved this using hash buck‐

ets heaps in the mappers in one MapReduce iteration. It's called

CountSketch , and it is built to handle odd datasets.

At Google there's a real-time monitor for MapReduce jobs, a box with

shards that correspond to pieces of work on a machine. It indicates

through a bar chart how the various machines are doing. If all the

mappers are running well, you'd see a straight line across. Usually,

however, everything goes wrong in the reduce step due to nonuni‐

formity of the data; e.g., lots of values on one key.

The data preparation and writing the output, which take place behind

the scenes, take a long time, so it's good to try to do everything in one

iteration. Note we're assuming a distributed filesystem is already there

—indeed we have to use MapReduce to get data to the distributed

filesystem—once we start using MapReduce, we can't stop.

Once you get into the optimization process, you find yourself tuning

MapReduce jobs to shave off nanoseconds 10 −9 while processing pe‐

tabytes of data. These are order shifts worthy of physicists. This opti‐

mization is almost all done in C++. It's highly optimized code, and we

try to scrape out every ounce of power we can.

Other Examples of MapReduce

Counting words is the most basic example of MapReduce. Let's look

at another to start getting more of a feel for it. The key attribute of a

problem that can be solved with MapReduce is that the data can be

distributed among many computers and the algorithm can treat each

of those computers separately, i.e., one computer doesn't need to know

what's going on with any other computer.

Here's another example where you could use MapReduce. Let's say you

had tons of timestamped event data and logs of users' actions on a

website. For each user, you might have {user_id, IP_address, zip

code, ad_they_saw, did_they_click} . Suppose you wanted to

count how many unique users saw ads from each zip code and how

many clicked at least once.

Search WWH ::

Custom Search

Home