Global Positioning System Reference
In-Depth Information
functions is that all the intermediate records in a reduce node are sorted
by the intermediate key and a group is formed for each different value of
the intermediate key. Custom partitioner and comparator functions can be
provided to replace the default functions.
The input and output data are usually stored in a distributed fi le
system (DFS). The MapReduce framework takes care of scheduling tasks,
monitoring them and re-executing them in case of failures.
The MRSimJoin Algorithm
The Similarity Join (SJ) operation between two datasets R and S is defi ned
as follows: R S G ( r,s ) S = r,s Å | S G ( r,s ), r ¢ R, s ¢ S }, where θε ( r, s ) represents
the Similarity Join predicate, i.e., dist ( r, s ) ≤ ε . A sample SJ query would be
to get the pairs of restaurants ( R ) and movie theatres ( S ) that are close to
each other (within a certain distance ε ).
The MRSimJoin algorithm presented in this section identifi es all the
pairs (links) that belong to the result of the Similarity Join operation. In
general, the input data can be given in one or multiple distributed fi les.
Each input data fi le contains a sequence of key-value records of the form
( id, ( id, elem )) where id contains two components, the id of the dataset or
relation this record belongs to ( id.relID ) and the id of the record in the
relation ( id.uniqueKey ), and elem is a latitude-longitude pair. Note that the
id component in the value of an input record is the same id component in
the key of that record.
MRSimJoin iteratively partitions the input data into smaller subsets
until each subset is small enough to be effi ciently processed by a single-
node SJ routine. The overall process is divided into a sequence of rounds,
where the initial round partitions the input data while any subsequent
round divides the data of a previously generated partition. Since each round
corresponds to a MapReduce job, the input and output of each job is read
from/written to the distributed fi le system. The output of a round includes:
(1) result links for the small partitions that were processed in a single-
node, and (2) intermediate data for the partitions that will require further
partitioning. Note that the DFS automatically divides and distributes the
intermediate data. The execution of a single MRSimJoin round is illustrated
in Fig. 1. This fi gure shows that the partitioning and generation of results
or intermediate data are performed in parallel by multiple nodes. All the
nodes partition the data using the same set of pivots that are previously
sent to each node. MRSimJoin executes the required rounds until all the
input and intermediate data is processed.
Data partitioning is performed using a set of K pivots, which are
a random subset of the data records to be partitioned. We use random
selection since this method was found to be effi cient by Jacox and Samet
Search WWH ::




Custom Search