Database Reference
In-Depth Information
similarity directly, in case that a small number of candidates remains.
A version of the algorithm that performs random accesses only for the
first k candidates is shown in Algorithm 6.2.1.
We can straightforwardly adapt the heaviest first top-k algo-
rithm for all normalized similarity measures . The benefit of the
heaviest first algorithm over the nra algorithms is the fact that,
for disk based storage devices, it takes advantage of sequential accesses,
one list at a time.
6.3 All-Match Join Queries
An all-match join query between two datasets S,R returns all pairs s, r
in the cross-product S
×
R with similarity larger than a user specified
threshold θ .
Join queries can be answered using a block nested loop join algo-
rithm. Given two string datasets S and R , the algorithm scans S
until main memory is full and subsequently scans R and identifies
all pairs of strings s, r with similarity exceeding the query threshold.
The obvious drawback of this algorithm is that it needs to repeatedly
scan R as many times as the number of main memory partitions of S
produced.
In certain cases it is more ecient to use inverted indexes to per-
form the join. If neither dataset is indexed, we can index either dataset
and subsequently scan the un-indexed dataset and perform one selec-
tion query per string therein. If both inverted indexes can fit in main
memory, the largest set is indexed. If the inverted index of only one
dataset can fit in main memory, that set is indexed. If neither can fit
in main memory, the largest set is partitioned such that the inverted
index of each partition can fit in main memory. Then, the un-indexed
dataset is scanned as many times as the number of partitions cre-
ated. Notice that in order to create the inverted indexes, one of the
datasets needs to be sorted first (depending on the type of index used,
records will have to be sorted either in string identifier order or L p -norm
order).
A simple partitioning strategy is based on sorting both datasets
S and R in increasing order of L p -norms, and using norm filtering
to reduce the number of strings examined from R in each iteration.
Search WWH ::




Custom Search