Database Reference
In-Depth Information
Observation 6.3. The answers that are potentially more similar to
the query have L 1 -norm equal to
1 poten-
tially yield the maximum similarity which is equal to one. Intuitively,
the potential similarity of candidate strings decreases as the L 1 -norm
of strings diverges from that of the query in either direction.
v
1 . Strings with
s
1 =
v
1
does not necessarily imply that s = v since s and v might differ in a
number of tokens that happen to have equal weights resulting in equal
L 1 -norms.
Based on Observation 6.3, the most promising candidates are the
strings with L 1 -norms clustered around the L 1 -norm of the query. We
can exploit this intuition in order to find a better k -th similarity approx-
imation faster. In essence, instead of starting the round robin iterations
of the nra algorithm from the beginning of the lists, we start from the
first string s within each list s.t. s 1 = v 1 . Then, we perform round
robin iterations in both directions, once towards the end of the lists
and once towards the beginning. When the algorithm has identified k
strings, the k -th smallest partial similarity becomes the query threshold
θ , and the algorithm can revert to normal all-match selection process-
ing by scanning lists within the L 1 -norm intervals given by Lemma
6.1. Notice that the algorithm is a heuristic. There is no guarantee
regarding the quality of the k -th partial similarity discovered using this
approach. In the best case the algorithm might immediately identify k
strings equal to the query and stop. In the worst case the algorithm
might be worse than vanilla nra (e.g., if it happens that the k -th most
similar string is either the first or the last element in all lists).
Another heuristic is to use an optimistic approach that assumes that
there are at least k data strings with similarity equal to one. Then, we
can run an all-match selection query using θ = 1. If the result contains
fewer than k answers, we reduce θ deterministically and restart the
previous query from where it stopped. We continue until k answers have
been produced. Once again, depending on the data and the particular
query this algorithm can either produce all answers within one iteration
Notice that we refer to potential similarity given that
s
1 =
v
 
Search WWH ::




Custom Search