Data Engineering: MapReduce, Pregel, and Hadoop - Doing Data Science

Databases Reference

In-Depth Information

too big for memory. On the other hand, this will probably work nearly

all the time, because nearly all the time there will be repetition. Real

programming is a messy game.

Hold up, computers nowadays are many-core machines; let's use them

all! Then the bandwidth will be the problem, so let's compress the

inputs, too. That helps, and moreover there are better alternatives that

get complex. A heap of hashed values has a bounded size and can be

well-behaved. A heap is something like a partially ordered set, and you

can throw away super small elements to avoid holding everything in

memory. This won't always work, but it will in most cases.

Are You Keeping Up?

You don't need to understand all these details, but we want

you to get a flavor for the motivation for why MapReduce is

even necessary.

Now we can deal with on the order of 10 trillion words, using one

computer.

Now say we have 10 computers. This will get us 100 trillion words.

Each computer has 1/10th of the input. Let's get each computer to

count up its share of the words. Then have each send its counts to one

“controller” machine. The controller adds them up and finds the high‐

est to solve the problem.

We can do this with hashed heaps, too, if we first learn network pro‐

gramming.

Now take a hundred computers. We can process a thousand trillion

words . But then the “fan-in,” where the results are sent to the controller,

will break everything because of bandwidth problem. We need a tree,

where every group of 10 machines sends data to one local controller,

and then they all send back to super controller. This will probably

work.

But… can we do this with 1,000 machines? No. It won't work. Because

at that scale, one or more computer will fail. If we denote by X the

variable that exhibits whether a given computer is working, so X = 0

means it works and X = 1 means it's broken, then we can assume:

P X = 0 = 1−ϵ

Search WWH ::

Custom Search

Home