Databases Reference
In-Depth Information
too big for memory. On the other hand, this will probably work nearly
all the time, because nearly all the time there will be repetition. Real
programming is a messy game.
Hold up, computers nowadays are many-core machines; let's use them
all! Then the bandwidth will be the problem, so let's compress the
inputs, too. That helps, and moreover there are better alternatives that
get complex. A heap of hashed values has a bounded size and can be
well-behaved. A heap is something like a partially ordered set, and you
can throw away super small elements to avoid holding everything in
memory. This won't always work, but it will in most cases.
Are You Keeping Up?
You don't need to understand all these details, but we want
you to get a flavor for the motivation for why MapReduce is
even necessary.
Now we can deal with on the order of 10 trillion words, using one
computer.
Now say we have 10 computers. This will get us 100 trillion words.
Each computer has 1/10th of the input. Let's get each computer to
count up its share of the words. Then have each send its counts to one
“controller” machine. The controller adds them up and finds the high‐
est to solve the problem.
We can do this with hashed heaps, too, if we first learn network pro‐
gramming.
Now take a hundred computers. We can process a thousand trillion
words . But then the “fan-in,” where the results are sent to the controller,
will break everything because of bandwidth problem. We need a tree,
where every group of 10 machines sends data to one local controller,
and then they all send back to super controller. This will probably
work.
But… can we do this with 1,000 machines? No. It won't work. Because
at that scale, one or more computer will fail. If we denote by X the
variable that exhibits whether a given computer is working, so X = 0
means it works and X = 1 means it's broken, then we can assume:
P X = 0 = 1−ϵ
Search WWH ::




Custom Search