Similarity Join for Big Geographic Data - Geographical Information Systems: Trends and Technologies

Global Positioning System Reference

In-Depth Information

to answer Similarity Joins. Unfortunately, however, there has not been

much work on the study of MapReduce-based Similarity Join techniques

for geographic data. This chapter addresses this problem by focusing on

the study, design and implementation techniques of a MapReduce-based

Similarity Join algorithm that can be used with geographic data and distance

functions. The main contributions of our work are:

• We describe MRSimJoin, an efficient MapReduce Similarity Join

algorithm. MRSimJoin extends the single-node QuickJoin algorithm

(Jacox and Samet 2008) by adapting it to the distributed MapReduce

framework (Dean and Ghemawat 2004) and integrating grouping,

sorting and parallelization techniques.

• The proposed algorithm is general enough to be used with any dataset

that lies in a metric space. Our focus is on the study of this operation

with geographic data, e.g., longitude-latitude pairs, and geographical

distance functions, e.g., Euclidean Distance on a plane where a

Spherical Earth was projected.

• We present guidelines to implement the algorithm in Hadoop, a highly

used open-source MapReduced-based system.

• We thoroughly evaluate the performance and scalability properties of

the implemented operation with synthetic and real-world geographic

datasets. We show that MRSimJoin performs signifi cantly better than

an adaptation of the state-of-the-art MapReduce Theta-Join algorithm

(Okcan and Riedewald 2011). MRSimJoin scales very well when

important parameters like epsilon, data size, and number of cluster

nodes increase.

A generic description of the MRSimJoin algorithm was presented

previously (Silva et al. 2012). This chapter, however, includes the following

contributions not included in the previous paper: (1) a focus on the

study of Similarity Joins in the context of geographical data and distance

functions, (2) completely new experimental evaluation of the performance

and scalability of MRSimJoin with geographic data (synthetic and real-

world datasets), and (3) MRSimJoin's pseudocode, algorithmic details and

descriptions not presented previously.

The remaining part of this chapter is organized as follows. First, we

present the related work. Then, the MRSimJoin algorithm is described in

detail. Next, we present the guidelines to implement MRSimJoin in Hadoop

and the performance evaluation of our implementation. Finally, we present

the conclusions and future research directions.

Search WWH ::

Custom Search

Home