Similarity Join for Big Geographic Data - Geographical Information Systems: Trends and Technologies

Global Positioning System Reference

In-Depth Information

where ο ఝ ൌ߮ ଶ െ߮ ଵ , ο ఒ ൌߣ ଶ െߣ ଵ , ߮ ௠ ൌ ఝ భ ାఝ మ

ଶ

, R is the radius of the earth,

&

φ and & N are in radians, and geoDist is in the same unit as R .

A Quick Introduction to MapReduce

MapReduce is one of the main software frameworks for distributed

processing (Dean and Ghemawat 2004). This framework is able to process

massive amounts of data and works by dividing the processing task into

two phases: map and reduce , for which the user provides two functions

named map and reduce . These functions have key-value pairs as inputs and

outputs which have the following general form:

ǣ ሺ ݇ͳǡݒͳ ሻ ՜ ሺ ݇ʹǡݒʹ ሻ

ǣ൫݇ʹǡሺݒʹሻ൯՜ሺ݇͵ǡݒ͵ሻ

Note that the input and output types of each function can be different.

However, the input of the reduce function should use the same types as the

output of the map function.

The execution of a MapReduce job works as follows. The framework

splits the input dataset into independent data chunks that are processed by

multiple independent map tasks in a parallel manner. Each map call is given

a pair ( k1 , v1 ) and produces a list of ( k2 , v2 ) pairs. The output of the map calls

is known as the intermediate output. The intermediate data is transferred

to the reduce nodes by a process known as the shuffl e . Each reduce node is

assigned a different subset of the intermediate key space; these subsets are

referred as partitions . The framework guarantees that all the intermediate

records with the same intermediate key ( k2 ) are sent to the same reducer

node. At each reduce node, all the received intermediate records are sorted

and grouped. Each formed group will be processed in a single reduce call.

Multiple reduce tasks are also executed in a parallel fashion. Each reduce call

receives a pair ( k2 ,list( v2 )) and produces as output a list of ( k3 , v3 ) pairs.

The processes of transferring the map outputs to the reduce nodes, sorting

the records at each destination node, and grouping these records are driven

by the partition , sortCompare and groupCompare functions, respectively. These

functions have the following form:

partition: k 2 → partitionNumber

sortCompare: ( k 2 1 , k 2 2 ) → {-1,0,1}

groupCompare: ( k 2 1 , k 2 2 ) → {-1,0,1}

The default implementation of the partition function receives an

intermediate key ( k 2) as input and generates a partition number based on

a hash value for k 2. The default sortCompare and groupCompare functions

directly compare two intermediate keys ( k 2 1 , k 2 2 ) and return − 1 ( k 2 1 < k 2 2 ),

0 ( k 2 1 = k 2 2 ), or +1 ( k 2 1 > k 2 2 ). The result of using the default comparator

Search WWH ::

Custom Search

Home