Similarity Join for Big Geographic Data - Geographical Information Systems: Trends and Technologies

Global Positioning System Reference

In-Depth Information

CHAPTER 2

Similarity Join for Big

Geographic Data

Yasin N. Silva,* Jason M. Reed, Lisa M. Tsosie and

Timothy A. Matti

Introduction

Similarity Join is one of the most useful data processing and analysis

operations for geographic data. It retrieves all data pairs whose distances are

smaller than a predefi ned threshold ε . Multiple application scenarios need

to perform this operation over large amounts of data. Internet companies,

for instance, collect massive amounts of information on their customers such

as their geographic location and interests. They can use similarity queries

to provide enhanced services to their customers; for example, a movie

theatre website could recommend neighboring theatres and restaurants

in the customer's town. MapReduce, a framework for processing very

large datasets using large computer clusters, constitutes an answer to the

requirements of processing massive amounts of data in a highly scalable

and distributed fashion (Dean and Ghemawat 2004). MapReduce-based

systems are composed of large clusters of commodity machines and are often

dynamically scalable, i.e., cluster nodes can be added or removed based

on the workload. The MapReduce framework quickly processes massive

datasets by splitting them into independent chunks that are processed in

a highly parallel fashion.

Multiple Similarity Join algorithms and implementation techniques

have been proposed. They range from approaches for only internal memory

or external memory data to techniques that make use of database operators

Arizona State University, 4701 W. Thunderbird Road, Glendale, AZ 85306, USA.

Search WWH ::

Custom Search

Home