Global Positioning System Reference
In-Depth Information
found processing the original base partitions P 0 and P 1, respectively; and
the window links in { E , D } and { C , F } are found in partitions Q 0 and Q 1,
respectively. The examples in Figs. 2 and 3 show that base and window-pair
partitions need to be processed differently. MRSimJoin should output links
in a base partition and window-links in a window-pair partition. Also, the
re-partitioning of the base and window-pair partitions generate different
numbers of new window-pair partitions.
The Main Algorithm
The main routine of MRSimJoin is presented in Algorithm 1. The routine
uses an intermediate directory (line 1) to store the partitions that will need
further repartitioning. The names of intermediate directories that store base
partitions have the following format:
outDir /intermediate/B_ roundNum _ p
The names of intermediate directories storing window-pair partitions
have the following format:
outDir /intermediate/W_ roundNum _[ uAttr ]_ p1 _ p2
In these expressions, p , p 1 and p 2 are pivot indices. uAttr is an attribute
that ensures unique directory names for the partitions generated in a round
and is needed only when a window-pair is repartitioned. The details of
how uAttr is specifi ed are presented in the Window-pair Partition Round
subsection.
Each iteration of the while loop (lines 3 to 22) corresponds to one
round and executes a MapReduce job. In each round, the initial input
data or a previously generated partition is repartitioned. The Similarity
Join output of a partition small enough to be processed in a single node
is obtained by running a single-node SJ algorithm. Larger partitions are
stored as intermediate data for further processing. For each round, the main
routine sets the values of the job input directory (lines 4 to 8) and randomly
selects numPivots pivots from this directory (line 12). Note that function
GetNextIntermPartitionDir returns the directory of one of the intermediate
partitions that needs further processing. Next, the routine executes a base
partition job (line 14) or a window-pair partition job (line 16) based on the
type of the job input directory. The routine MRJob sets up a MapReduce
job that will use the provided map , reduce, partition and compare functions.
The partition function will be used to replace the default partition function.
The compare function will be used to replace the default sortCompare and
groupCompare functions. MRJob also makes sure that the provided atomic
parameters, i.e., outDir , numPiv , eps and memT , are available at every node
that will be used in the MapReduce job and that the pivots are available at
each node that will execute map tasks. If a round is processing a previously
Search WWH ::




Custom Search