Global Positioning System Reference
In-Depth Information
found processing the original base partitions
P
0 and
P
1, respectively; and
the window links in {
E
,
D
} and {
C
,
F
} are found in partitions
Q
0 and
Q
1,
respectively. The examples in Figs. 2 and 3 show that base and window-pair
partitions need to be processed differently. MRSimJoin should output links
in a base partition and window-links in a window-pair partition. Also, the
re-partitioning of the base and window-pair partitions generate different
numbers of new window-pair partitions.
The Main Algorithm
The main routine of MRSimJoin is presented in Algorithm 1. The routine
uses an intermediate directory (line 1) to store the partitions that will need
further repartitioning. The names of intermediate directories that store base
partitions have the following format:
⟨
outDir
⟩
/intermediate/B_
⟨
roundNum
⟩
_
⟨
p
⟩
The names of intermediate directories storing window-pair partitions
have the following format:
⟨
outDir
⟩
/intermediate/W_
⟨
roundNum
⟩
_[
⟨
uAttr
⟩
]_
⟨
p1
⟩
_
⟨
p2
⟩
In these expressions,
p
,
p
1 and
p
2 are pivot indices.
uAttr
is an attribute
that ensures unique directory names for the partitions generated in a round
and is needed only when a window-pair is repartitioned. The details of
how
uAttr
is specifi ed are presented in the Window-pair Partition Round
subsection.
Each iteration of the while loop (lines 3 to 22) corresponds to one
round and executes a MapReduce job. In each round, the initial input
data or a previously generated partition is repartitioned. The Similarity
Join output of a partition small enough to be processed in a single node
is obtained by running a single-node SJ algorithm. Larger partitions are
stored as intermediate data for further processing. For each round, the main
routine sets the values of the job input directory (lines 4 to 8) and randomly
selects
numPivots
pivots from this directory (line 12). Note that function
GetNextIntermPartitionDir
returns the directory of one of the intermediate
partitions that needs further processing. Next, the routine executes a base
partition job (line 14) or a window-pair partition job (line 16) based on the
type of the job input directory. The routine
MRJob
sets up a MapReduce
job that will use the provided
map
,
reduce, partition
and
compare
functions.
The
partition
function will be used to replace the default
partition
function.
The
compare
function will be used to replace the default
sortCompare
and
groupCompare
functions.
MRJob
also makes sure that the provided atomic
parameters, i.e.,
outDir
,
numPiv
,
eps
and
memT
, are available at every node
that will be used in the MapReduce job and that the
pivots
are available at
each node that will execute
map
tasks. If a round is processing a previously
Search WWH ::
Custom Search