Big Data Processing Systems - Cloud Data Management

Database Reference

In-Depth Information

improved repartition join strategy fixes the buffering problem by introducing the

following key changes:

-

In the map function, the output key is changed to a composite of the join key

and the table tag. The table tags are generated in a way that ensures records

from one input relation will be sorted ahead of those from the other input

relation on a given join key.

-

The partitioning function is customized so that the hashcode is computed from

just the join key part of the composite key. This way records with the same

join key are still assigned to the same reduce task.

-

As records from the smaller input are guaranteed to be ahead of those from L

for a given join key, only the records from the smaller input are buffered and

the records of the larger input are streamed to generate the join output.

Broadcast join : Instead of moving both input relations across the network as in

the repartition-based joins, the broadcast join approach moves only the smaller

input relation so that it avoids the preprocessing sorting requirement of both

input relations and more importantly avoids the network overhead for moving

the larger relation.

Semi-join : This join approach tries to avoid the problem of the broadcast join

approach where it is possible to send many records of the smaller input relation

across the network while they may not be actually referenced by any records in

the other relation. It achieves this goal at the cost of an extra scan of the smaller

input relation where it determines the set of unique join keys in the smaller

relation, send them to the other relation to specify the list of the actual referenced

join keys and then send only these records across the network for executing the

real execution of the join operation.

Per-split semi-join : This join approach tries to improve the semi-join approach

with a further step to address the fact that not every record in the filtered version

of the smaller relation will join with a particular split of the larger relation.

Therefore, an extra process step is executed to determine the target split(s) of

each filtered join key.

Figure 9.3 illustrates a decision tree that summarizes the tradeoffs of the studied

join strategies according to the results of that study. Based on statistics, such as

the relative data size and the fraction of the join key referenced, this decision tree

tries to determine what is the right join strategy for a given circumstance. If data is

not preprocessed, the right join strategy depends on the size of the data transferred

via the network. If the network cost of broadcasting an input relation R to every

node is less expensive than transferring both R and projected L, then the broadcast

join algorithm should be used. When preprocessing is allowed, semi-join, per-split

semi-join and directed join with sufficient partitions are the best choices. Semi-join

and per-split semi-join offer further flexibility since their preprocessing steps are

insensitive to how the log table is organized, and thus suitable for any number of

reference tables. In addition, the preprocessing steps of these two algorithms are

cheaper since there is no shuffling of the log data.

Cloud Data Management

Search WWH ::

Custom Search

Home