Similarity Join for Big Geographic Data - Geographical Information Systems: Trends and Technologies

Global Positioning System Reference

In-Depth Information

DistributedCache , a facility that allows the effi cient distribution of application

specifi c, large, read-only fi les.

Renaming directories . The main MRSimJoin routine renames a directory

to fl ag it as already processed. This can be done using the rename method

of Hadoop's FileSystem class. The method will change the directory path in

Hadoop's distributed fi le system without physically moving its data.

Single-node Similarity Join . InMemorySimJoin and InMemorySimJoin-Win

represent single-node algorithms to get the links and window links in a

given dataset, respectively. We have implemented these functions using

the Quickjoin algorithm (Jacox and Samet 2008).

Performance Evaluation

We implemented MRSimJoin using the Hadoop 0.20.2 MapReduce

framework. In this section we evaluate its performance with synthetic and

real-world geographic data.

Test Confi guration

We performed the experiments using a Hadoop cluster running on the

Amazon Elastic Compute Cloud (Amazon EC2). Unless otherwise stated,

we used a cluster of 10 nodes (1 master + 9 worker nodes) with the following

specifi cations: 15 GB of memory, 4 virtual cores with 2 EC2 Compute Units

each, 1,690 GB of local instance storage, and 64-bit platform. We set the block

size of the distributed systems to 64 MB and the total number of reducers

to: 0.95 × ⟨ no. worker nodes ⟩ × ⟨ max reduce tasks per node ⟩ . We use the

following datasets:

SynthData This is a synthetic geographic dataset (longitude-latitude

pairs). The dataset for scale factor 1 (SF1) contains 2 million records

(86.9MB). The range of latitude and longitude values of the SF1 dataset

are [25, 50] and [65, 125], respectively.

GeoNames This dataset contains longitude-latitude pairs extracted

from the GeoNames database (GeoNames 2013). These records

represent the location of various US geographical features. The SF1

dataset contains 2,023,687 records (52.1 MB) with latitude and longitude

ranges of [25, 50] and [65, 125], respectively.

The datasets for SF greater than 1 were generated in such a way that the

number of links of any SJ operation in SF N are N times the number of links

of the operation in SF1. Specifi cally, the datasets for higher SFs were obtained

adding shifted copies of the SF1 dataset such that the separation between

the region of new records and the region of previous records is greater

Search WWH ::

Custom Search

Home