Similarity Join for Big Geographic Data - Geographical Information Systems: Trends and Technologies

Global Positioning System Reference

In-Depth Information

found processing the original base partitions P 0 and P 1, respectively; and

the window links in { E , D } and { C , F } are found in partitions Q 0 and Q 1,

respectively. The examples in Figs. 2 and 3 show that base and window-pair

partitions need to be processed differently. MRSimJoin should output links

in a base partition and window-links in a window-pair partition. Also, the

re-partitioning of the base and window-pair partitions generate different

numbers of new window-pair partitions.

The Main Algorithm

The main routine of MRSimJoin is presented in Algorithm 1. The routine

uses an intermediate directory (line 1) to store the partitions that will need

further repartitioning. The names of intermediate directories that store base

partitions have the following format:

⟨ outDir ⟩ /intermediate/B_ ⟨ roundNum ⟩ _ ⟨ p ⟩

The names of intermediate directories storing window-pair partitions

have the following format:

⟨ outDir ⟩ /intermediate/W_ ⟨ roundNum ⟩ _[ ⟨ uAttr ⟩ ]_ ⟨ p1 ⟩ _ ⟨ p2 ⟩

In these expressions, p , p 1 and p 2 are pivot indices. uAttr is an attribute

that ensures unique directory names for the partitions generated in a round

and is needed only when a window-pair is repartitioned. The details of

how uAttr is specifi ed are presented in the Window-pair Partition Round

subsection.

Each iteration of the while loop (lines 3 to 22) corresponds to one

round and executes a MapReduce job. In each round, the initial input

data or a previously generated partition is repartitioned. The Similarity

Join output of a partition small enough to be processed in a single node

is obtained by running a single-node SJ algorithm. Larger partitions are

stored as intermediate data for further processing. For each round, the main

routine sets the values of the job input directory (lines 4 to 8) and randomly

selects numPivots pivots from this directory (line 12). Note that function

GetNextIntermPartitionDir returns the directory of one of the intermediate

partitions that needs further processing. Next, the routine executes a base

partition job (line 14) or a window-pair partition job (line 16) based on the

type of the job input directory. The routine MRJob sets up a MapReduce

job that will use the provided map , reduce, partition and compare functions.

The partition function will be used to replace the default partition function.

The compare function will be used to replace the default sortCompare and

groupCompare functions. MRJob also makes sure that the provided atomic

parameters, i.e., outDir , numPiv , eps and memT , are available at every node

that will be used in the MapReduce job and that the pivots are available at

each node that will execute map tasks. If a round is processing a previously

Search WWH ::

Custom Search

Home