Global Positioning System Reference
In-Depth Information
that adds a join stage before the reduce stage. In this approach, mappers
read from input relations, the output of mappers is distributed to joiners
where the actual join task takes place, and the output of joiners is processed
by the reducers.
Most of the previous work on MapReduce-based Joins considers
the case of equi-joins. The two main types of MapReduce-based joins
are Map-side joins and Reduce-side joins. Among the Map-side joins we
have Map-Merge and Broadcast Join. The Map-Merge approach has two
steps: in the fi rst one, input relations are partitioned and sorted, and in
the second one, mappers merge the intermediate results (White 2010). The
Broadcast Join approach considers the case where one of the relations is
small enough to be sent to all mappers and maintained in memory (Blanas
et al. 2010; Chen 2010). The overall execution time is reduced by avoiding
sorting and distributing on both input relations. Repartition join is the most
representative instance of Reduce-side joins (White 2010). In this approach,
the mappers augment each record with a label that identifi es its relation. All
the records with the same join attribute value are sent to the same reducer.
Reducers in turn produce the join pairs.
Recently, MRThetaJoin, a MapReduce-based approach was proposed
to implement Theta-joins (Okcan and Riedewald 2011). This previous
work proposed a randomized algorithm that requires some basic statistics
(input cardinality). The approach proposes a model that partitions the
input relations using a matrix that considers all the combinations of
records that would be required to answer a cross product. The matrix
cells are then assigned to reducers in a way that minimizes job completion
time. A memory-aware variant is also proposed for the common scenario
where partitions do not fi t in memory. This previous work represents the
state-of-the-art approach to answer arbitrary joins in MapReduce. In this
chapter, we compare MRSimJoin with an adaptation of the memory-aware
algorithm to answer Similarity Joins and show that MRSimJoin performs
signifi cantly better.
The work by Vernica et al. (2010), studied the problem of Similarity Joins
in the context of cloud systems. This work, however, focuses on the study of
a different and more specialized type of Similarity Join (Set-Similarity Join)
which constrains its applicability to set-based data. The differences between
this work and ours are: (1) we consider the case of the most extensively
used type of Similarity Join (distance range join), and (2) our approach can
be used with data that lies in any metric space.
A comparison of the MapReduce framework and parallel databases was
presented by Pavlo et al. 2009. Multiple parallel join algorithms have been
proposed in the context of parallel databases (Kitsuregawa and Ogawa 1990;
Schneider and DeWitt 1989). The work by Kitesuregawa and Ogawa 1990
presents a comparison of several hash-based and sort-merge-based parallel
Search WWH ::




Custom Search