Similarity Join for Big Geographic Data - Geographical Information Systems: Trends and Technologies

Global Positioning System Reference

In-Depth Information

Note that, in the case of Reduce_windowPair , all partitions that are stored

for further processing are set to be repartitioned by a future window-pair

partition round. This is the case because the links generated in a window-

pair round or in any of its partitions should always be window links. In

the scenario represented in Fig. 7, the MapReduce framework calls the

Reduce_windowPair function for each partition of Fig. 8b: Q 0, Q 1, Q 0_ Q 1{1}

and Q 0_ Q 1{2}. Observe that the value of uAttr in the output directory name

is k 2. This component ensures unique directory names. Assuming that the

values of k 2 of Q 0_ Q 1{1} and Q 0_ Q 1{2} belong to their bottom windows,

the values of uAttr are: Q 0: (Q 0 ,-1,-1) , Q 1: (Q 1 ,-1,-1), Q 0_ Q 1{1}: (Q 0 ,Q 1 ,P 0 ),

and Q 0_ Q 1{2}: (Q 0 ,Q 1 ,P 1 ).

Enhancements for Geographical Distance

Since the MRSimJoin solution presented in the MRSimJoin Algorithm

subsection is based on the generalized hyperplane distance, it could be used

with any dataset that lies in a metric space. The solution, however, could

be enhanced in cases where the distance from a record to the hyperplane

between two partitions can be computed exactly (Jacox and Samet 2008).

In the case of the geographical distance geoDist defi ned in the Geographic

Data and Distance Functions subsection (Euclidean distance on a plane

where a Spherical Earth was projected), the exact distance from a record

t to the hyperplane that separates the partitions of two pivots P 0 and P 1 is

given by:

hDist ( t, P 0 , P 1 ) = ( geoDist ( t, P 0 ) 2 − geoDist ( t, P 1 ) 2 ) / (2 × geoDist ( P 0 , P 1 )).

To use this distance, the GHP distance should be replaced by hDist in

line 5 of Map_base and also in line 5 of Map_windowPair .

Implementation in Hadoop

The presented MRSimJoin algorithms are generic enough to be implemented

in any MapReduce framework. This section presents a few additional

guidelines for its implementation on the popular Hadoop MapReduce

framework (Apache Hadoop 2013).

Distribution of atomic parameters . One of the tasks of the MRJob function,

called in the main MRSimJoin routine, is to make sure that the provided

atomic parameters, i.e., outDir , numPiv , eps and memT , are available at every

node that will be used in the MapReduce job. In Hadoop, this can be done

using the job confi guration jobConf object and its methods set and get .

Distribution of pivots . MRJob also sends the list of pivots to every

node that will execute a map task. In Hadoop this can be done using the

Search WWH ::

Custom Search

Home