Global Positioning System Reference
In-Depth Information
CHAPTER 2
Similarity Join for Big
Geographic Data
Yasin N. Silva,* Jason M. Reed, Lisa M. Tsosie and
Timothy A. Matti
Introduction
Similarity Join is one of the most useful data processing and analysis
operations for geographic data. It retrieves all data pairs whose distances are
smaller than a predefi ned threshold ε . Multiple application scenarios need
to perform this operation over large amounts of data. Internet companies,
for instance, collect massive amounts of information on their customers such
as their geographic location and interests. They can use similarity queries
to provide enhanced services to their customers; for example, a movie
theatre website could recommend neighboring theatres and restaurants
in the customer's town. MapReduce, a framework for processing very
large datasets using large computer clusters, constitutes an answer to the
requirements of processing massive amounts of data in a highly scalable
and distributed fashion (Dean and Ghemawat 2004). MapReduce-based
systems are composed of large clusters of commodity machines and are often
dynamically scalable, i.e., cluster nodes can be added or removed based
on the workload. The MapReduce framework quickly processes massive
datasets by splitting them into independent chunks that are processed in
a highly parallel fashion.
Multiple Similarity Join algorithms and implementation techniques
have been proposed. They range from approaches for only internal memory
or external memory data to techniques that make use of database operators
Arizona State University, 4701 W. Thunderbird Road, Glendale, AZ 85306, USA.
 
Search WWH ::




Custom Search