Database Reference
In-Depth Information
improved repartition join strategy fixes the buffering problem by introducing the
following key changes:
-
In the map function, the output key is changed to a composite of the join key
and the table tag. The table tags are generated in a way that ensures records
from one input relation will be sorted ahead of those from the other input
relation on a given join key.
-
The partitioning function is customized so that the hashcode is computed from
just the join key part of the composite key. This way records with the same
join key are still assigned to the same reduce task.
-
As records from the smaller input are guaranteed to be ahead of those from L
for a given join key, only the records from the smaller input are buffered and
the records of the larger input are streamed to generate the join output.
￿
Broadcast join : Instead of moving both input relations across the network as in
the repartition-based joins, the broadcast join approach moves only the smaller
input relation so that it avoids the preprocessing sorting requirement of both
input relations and more importantly avoids the network overhead for moving
the larger relation.
￿
Semi-join : This join approach tries to avoid the problem of the broadcast join
approach where it is possible to send many records of the smaller input relation
across the network while they may not be actually referenced by any records in
the other relation. It achieves this goal at the cost of an extra scan of the smaller
input relation where it determines the set of unique join keys in the smaller
relation, send them to the other relation to specify the list of the actual referenced
join keys and then send only these records across the network for executing the
real execution of the join operation.
￿
Per-split semi-join : This join approach tries to improve the semi-join approach
with a further step to address the fact that not every record in the filtered version
of the smaller relation will join with a particular split of the larger relation.
Therefore, an extra process step is executed to determine the target split(s) of
each filtered join key.
Figure 9.3 illustrates a decision tree that summarizes the tradeoffs of the studied
join strategies according to the results of that study. Based on statistics, such as
the relative data size and the fraction of the join key referenced, this decision tree
tries to determine what is the right join strategy for a given circumstance. If data is
not preprocessed, the right join strategy depends on the size of the data transferred
via the network. If the network cost of broadcasting an input relation R to every
node is less expensive than transferring both R and projected L, then the broadcast
join algorithm should be used. When preprocessing is allowed, semi-join, per-split
semi-join and directed join with sufficient partitions are the best choices. Semi-join
and per-split semi-join offer further flexibility since their preprocessing steps are
insensitive to how the log table is organized, and thus suitable for any number of
reference tables. In addition, the preprocessing steps of these two algorithms are
cheaper since there is no shuffling of the log data.
Search WWH ::




Custom Search