MapReduce and the New Software Stack - Mining of Massive Datasets

Database Reference

In-Depth Information

2.6.5

When Not All Inputs Are Present

Example 2.15 describes a problem where we know every possible input is present, because

we can define the input set to be those pictures that actually exist in the dataset. However,

as discussed at the end of Section 2.6.3 , there are problems like computing the join, where

the graph of inputs and outputs describes inputs that might exist, and outputs that are only

made when at least one of the inputs exists in the dataset. In fact, for the join, both inputs

related to an output must exist if we are to make that output.

An algorithm for a problem where outputs can be missing still needs a mapping schema.

The justification is that all inputs, or any subset of them, might be present, so an algorithm

without a mapping schema would not be able to produce every possible output if all the

inputs related to that output happened to be present, and yet no reducer covered that output.

The only way the absence of some inputs makes a difference is that we may wish to re-

think the desired value of the reducer size q when we select an algorithm from the family

of possible algorithms. Especially, if the value of q we select is that number such that we

can be sure the input will just fit in main memory, then we may wish to increase q to take

into account that some fraction of the inputs are not really there.

EXAMPLE 2.16 Suppose that we know we can execute the Reduce function in main memory

on a key and its associated list of q values. However, we also know that only 5% of the

possible inputs are really present in the data set. Then a mapping schema for reducer size

q will really send about q/ 20 of the inputs that exist to each reducer. Put another way, we

could use the algorithm for reducer size 20 q and expect that an average of q inputs will

actually appear on the list for each reducer. We can thus choose 20 q as the reducer size,

or since there will be some randomness in the number of inputs actually appearing at each

reducer, we might wish to pick a slightly smaller value of reducer size, such as 18 q .

□

2.6.6

Lower Bounds on Replication Rate

The family of similarity-join algorithms described in Example 2.15 lets us trade off com-

munication against the reducer size, and through reducer size to trade communication

against parallelism or against the ability to execute the Reduce function in main memory.

How do we know we are getting the best possible tradeoff? We can only know we have the

minimum possible communication if we can prove a matching lower bound. Using exist-

ence of a mapping schema as the starting point, we can often prove such a lower bound.

Here is an outline of the technique.

Search WWH ::

Custom Search

Home