Similarity Join for Big Geographic Data - Geographical Information Systems: Trends and Technologies

Global Positioning System Reference

In-Depth Information

functions is that all the intermediate records in a reduce node are sorted

by the intermediate key and a group is formed for each different value of

the intermediate key. Custom partitioner and comparator functions can be

provided to replace the default functions.

The input and output data are usually stored in a distributed fi le

system (DFS). The MapReduce framework takes care of scheduling tasks,

monitoring them and re-executing them in case of failures.

The MRSimJoin Algorithm

The Similarity Join (SJ) operation between two datasets R and S is defi ned

as follows: R ⨝ S G ( r,s ) S = {µ r,s Å | S G ( r,s ), r ¢ R, s ¢ S }, where θε ( r, s ) represents

the Similarity Join predicate, i.e., dist ( r, s ) ≤ ε . A sample SJ query would be

to get the pairs of restaurants ( R ) and movie theatres ( S ) that are close to

each other (within a certain distance ε ).

The MRSimJoin algorithm presented in this section identifi es all the

pairs (links) that belong to the result of the Similarity Join operation. In

general, the input data can be given in one or multiple distributed fi les.

Each input data fi le contains a sequence of key-value records of the form

( id, ( id, elem )) where id contains two components, the id of the dataset or

relation this record belongs to ( id.relID ) and the id of the record in the

relation ( id.uniqueKey ), and elem is a latitude-longitude pair. Note that the

id component in the value of an input record is the same id component in

the key of that record.

MRSimJoin iteratively partitions the input data into smaller subsets

until each subset is small enough to be effi ciently processed by a single-

node SJ routine. The overall process is divided into a sequence of rounds,

where the initial round partitions the input data while any subsequent

round divides the data of a previously generated partition. Since each round

corresponds to a MapReduce job, the input and output of each job is read

from/written to the distributed fi le system. The output of a round includes:

(1) result links for the small partitions that were processed in a single-

node, and (2) intermediate data for the partitions that will require further

partitioning. Note that the DFS automatically divides and distributes the

intermediate data. The execution of a single MRSimJoin round is illustrated

in Fig. 1. This fi gure shows that the partitioning and generation of results

or intermediate data are performed in parallel by multiple nodes. All the

nodes partition the data using the same set of pivots that are previously

sent to each node. MRSimJoin executes the required rounds until all the

input and intermediate data is processed.

Data partitioning is performed using a set of K pivots, which are

a random subset of the data records to be partitioned. We use random

selection since this method was found to be effi cient by Jacox and Samet

Search WWH ::

Custom Search

Home