Database Reference
In-Depth Information
2.6.5
When Not All Inputs Are Present
Example 2.15 describes a problem where we know every possible input is present, because
we can define the input set to be those pictures that actually exist in the dataset. However,
as discussed at the end of Section 2.6.3 , there are problems like computing the join, where
the graph of inputs and outputs describes inputs that might exist, and outputs that are only
made when at least one of the inputs exists in the dataset. In fact, for the join, both inputs
related to an output must exist if we are to make that output.
An algorithm for a problem where outputs can be missing still needs a mapping schema.
The justification is that all inputs, or any subset of them, might be present, so an algorithm
without a mapping schema would not be able to produce every possible output if all the
inputs related to that output happened to be present, and yet no reducer covered that output.
The only way the absence of some inputs makes a difference is that we may wish to re-
think the desired value of the reducer size q when we select an algorithm from the family
of possible algorithms. Especially, if the value of q we select is that number such that we
can be sure the input will just fit in main memory, then we may wish to increase q to take
into account that some fraction of the inputs are not really there.
EXAMPLE 2.16 Suppose that we know we can execute the Reduce function in main memory
on a key and its associated list of q values. However, we also know that only 5% of the
possible inputs are really present in the data set. Then a mapping schema for reducer size
q will really send about q/ 20 of the inputs that exist to each reducer. Put another way, we
could use the algorithm for reducer size 20 q and expect that an average of q inputs will
actually appear on the list for each reducer. We can thus choose 20 q as the reducer size,
or since there will be some randomness in the number of inputs actually appearing at each
reducer, we might wish to pick a slightly smaller value of reducer size, such as 18 q .
2.6.6
Lower Bounds on Replication Rate
The family of similarity-join algorithms described in Example 2.15 lets us trade off com-
munication against the reducer size, and through reducer size to trade communication
against parallelism or against the ability to execute the Reduce function in main memory.
How do we know we are getting the best possible tradeoff? We can only know we have the
minimum possible communication if we can prove a matching lower bound. Using exist-
ence of a mapping schema as the starting point, we can often prove such a lower bound.
Here is an outline of the technique.
Search WWH ::




Custom Search