Similarity Join for Big Geographic Data - Geographical Information Systems: Trends and Technologies - page 31

Global Positioning System Reference

In-Depth Information

The partition is small enough to be solved in a

single node. Results written to final output in HDFS

(Hadoop Distributed File System ).

The partition will need to be further re -partitioned

in additional MapReduce rounds. Intermediate

data is written to HDFS.

Single-node

Distributed

Input Data

MR 1

P1

P2

P3

MR 2

MR 3

MR 4

P4

P5

P6

P7

P8

P9

P10

P11

P12

MR 5

MR 6

P13

P14

P15

P16

P17

P18

Fig. 4. Example of MRSimJoin rounds and partitions.

Base Partition Round

A base partition round processes the initial input data or a base partition

previously generated by a base partition round. The goal of a base partition

MapReduce job is to partition its input data and produce: the result links

for partitions that are small enough to be processed in a single node, and

intermediate data for partitions that require further processing. The main

routine sets up each base partition MapReduce job to use Map_base and

Reduce_base as the map and reduce functions, respectively. Additionally,

the default partition function is replaced by Partition_base and the default

sortCompare and groupCompare functions are replaced by Compare_base .

Map_base , the map function for the base partition rounds, is presented in

Algorithm 2. The format of the input key-value pairs, i.e., k 1 , v1 , is: k 1 = id,

v 1 = ( id, elem ), and the format of the intermediate key-value pairs, i.e., k 2 , v 2,

is: k 2 = ( part,win ) , v 2 = ( id, elem, part ), where part identifi es the base partition

of the intermediate record and win the window partition of the record. We

use the value -1 when a given parameter is not applicable or will not be

needed in the future. The MapReduce framework divides the job input

data into chunks and creates map tasks in multiple nodes to process them.

Each map task is called multiple times and each call executes the Map_base

function for a given record ( id, ( id, elem )) of the input data. The Map_base

function identifi es the closest pivot p to elem (line 1). The function then

outputs one intermediate key-value pair of the form (( p,− 1) , ( id, elem,− 1))

for the base partition that elem belongs to (line 2) and one pair of the form

(( p, i ) , ( id, elem, p )) for each window-pair partition (corresponding to pivots

Next Page

Geographical Information Systems: Trends and Technologies

Search WWH ::

Custom Search

Home