Global Positioning System Reference
In-Depth Information
to answer Similarity Joins. Unfortunately, however, there has not been
much work on the study of MapReduce-based Similarity Join techniques
for geographic data. This chapter addresses this problem by focusing on
the study, design and implementation techniques of a MapReduce-based
Similarity Join algorithm that can be used with geographic data and distance
functions. The main contributions of our work are:
• We describe MRSimJoin, an efficient MapReduce Similarity Join
algorithm. MRSimJoin extends the single-node QuickJoin algorithm
(Jacox and Samet 2008) by adapting it to the distributed MapReduce
framework (Dean and Ghemawat 2004) and integrating grouping,
sorting and parallelization techniques.
• The proposed algorithm is general enough to be used with any dataset
that lies in a metric space. Our focus is on the study of this operation
with geographic data, e.g., longitude-latitude pairs, and geographical
distance functions, e.g., Euclidean Distance on a plane where a
Spherical Earth was projected.
• We present guidelines to implement the algorithm in Hadoop, a highly
used open-source MapReduced-based system.
• We thoroughly evaluate the performance and scalability properties of
the implemented operation with synthetic and real-world geographic
datasets. We show that MRSimJoin performs signifi cantly better than
an adaptation of the state-of-the-art MapReduce Theta-Join algorithm
(Okcan and Riedewald 2011). MRSimJoin scales very well when
important parameters like epsilon, data size, and number of cluster
nodes increase.
A generic description of the MRSimJoin algorithm was presented
previously (Silva et al. 2012). This chapter, however, includes the following
contributions not included in the previous paper: (1) a focus on the
study of Similarity Joins in the context of geographical data and distance
functions, (2) completely new experimental evaluation of the performance
and scalability of MRSimJoin with geographic data (synthetic and real-
world datasets), and (3) MRSimJoin's pseudocode, algorithmic details and
descriptions not presented previously.
The remaining part of this chapter is organized as follows. First, we
present the related work. Then, the MRSimJoin algorithm is described in
detail. Next, we present the guidelines to implement MRSimJoin in Hadoop
and the performance evaluation of our implementation. Finally, we present
the conclusions and future research directions.
Search WWH ::




Custom Search