Global Positioning System Reference
In-Depth Information
DistributedCache , a facility that allows the effi cient distribution of application
specifi c, large, read-only fi les.
Renaming directories . The main MRSimJoin routine renames a directory
to fl ag it as already processed. This can be done using the rename method
of Hadoop's FileSystem class. The method will change the directory path in
Hadoop's distributed fi le system without physically moving its data.
Single-node Similarity Join . InMemorySimJoin and InMemorySimJoin-Win
represent single-node algorithms to get the links and window links in a
given dataset, respectively. We have implemented these functions using
the Quickjoin algorithm (Jacox and Samet 2008).
Performance Evaluation
We implemented MRSimJoin using the Hadoop 0.20.2 MapReduce
framework. In this section we evaluate its performance with synthetic and
real-world geographic data.
Test Confi guration
We performed the experiments using a Hadoop cluster running on the
Amazon Elastic Compute Cloud (Amazon EC2). Unless otherwise stated,
we used a cluster of 10 nodes (1 master + 9 worker nodes) with the following
specifi cations: 15 GB of memory, 4 virtual cores with 2 EC2 Compute Units
each, 1,690 GB of local instance storage, and 64-bit platform. We set the block
size of the distributed systems to 64 MB and the total number of reducers
to: 0.95 × no. worker nodes × max reduce tasks per node . We use the
following datasets:
￿ SynthData This is a synthetic geographic dataset (longitude-latitude
pairs). The dataset for scale factor 1 (SF1) contains 2 million records
(86.9MB). The range of latitude and longitude values of the SF1 dataset
are [25, 50] and [65, 125], respectively.
￿ GeoNames This dataset contains longitude-latitude pairs extracted
from the GeoNames database (GeoNames 2013). These records
represent the location of various US geographical features. The SF1
dataset contains 2,023,687 records (52.1 MB) with latitude and longitude
ranges of [25, 50] and [65, 125], respectively.
The datasets for SF greater than 1 were generated in such a way that the
number of links of any SJ operation in SF N are N times the number of links
of the operation in SF1. Specifi cally, the datasets for higher SFs were obtained
adding shifted copies of the SF1 dataset such that the separation between
the region of new records and the region of previous records is greater
Search WWH ::




Custom Search