Database Reference
In-Depth Information
the list of the actual referenced join keys and then send only these records
across the network for executing the real execution of the join operation.
•
Per-split semi-join
:
This join approach tries to improve the semi-join
approach with a further step to address the fact that not every record in the
filtered version of the smaller relation will join with a particular split of
the larger relation. Therefore, an extra process step is executed to determine
the target split(s) of each filtered join key.
Figure 2.3 illustrates a decision tree that summarizes the tradeoffs of the studied
join strategies according to the results of that study. Based on statistics, such as the
relative data size and the fraction of the join key referenced, this decision tree tries to
determine what is the right join strategy for a given circumstance. If data is not pre-
processed, the right join strategy depends on the size of the data transferred via the
network. If the network cost of broadcasting an input relation
R to every node is less
expensive than transferring both
R
and projected
L,
then the broadcast join algorithm
should be used. When preprocessing is allowed, semi-join, per-split semi-join, and
directed join with sufficient partitions are the best choices. Semi-join and per-split
semi-join offer further flexibility since their preprocessing steps are insensitive to
how the log table is organized and thus suitable for any number of reference tables.
In addition, the preprocessing steps of these two algorithms are cheaper since there
is no shuffling of the log data.
To tackle the limitation of the extra processing requirements for performing join
operations in the MapReduce framework, the
map-reduce-merge
model [36] have
been introduced to enable the processing of multiple data sets. Figure 2.4 illustrates
the framework of this model where the map phase transforms an input key/value
Preprocessing
No
Ye s
Cheaper to replicate
reference table than
repartition log table?
Preprocessing maintainability
No
Ye s
High
Low
Improved
Broadcast
Semi-join or
per-split semi-join
Prepartitioning with
a large number
of partitions
FIGURE 2.3
Decision tree for choosing between various join strategies on the MapReduce
framework. (From S. Blanas et al., A comparison of join algorithms for log processing in
mapreduce, in
SIGMOD
, pp. 975-986, 2010.)
Search WWH ::
Custom Search