Database Reference
In-Depth Information
index on string identifiers is available on each of the lists in the sux,
the algorithm needs to perform, on average, one random access per
candidate per remaining list. If an index is not available, then assum-
ing that lists are sorted in increasing norm order, the algorithm keeps
track of the smallest and largest norm in the candidate set, seeks to the
first list element with norm larger equal to the smallest norm and scans
each list as deep as the largest norm. Alternatively, binary search can
be performed. This strategy will help potentially prune a long head
and tail from each list in the sux. Alternatively, if the size of the
remaining candidate set is small after the heaviest first strategy on the
shortest lists terminates, the algorithm can simply retrieve the strings
and compute the actual similarity.
Notice that in case of uniform token weights, it is meaningless to
order lists according to weights. In this special case lists are simply
sorted in increasing order of their lengths and all the algorithms dis-
cussed so far will work without any modifications, with similar ben-
efits. This results in a shortest-first prioritization of lists, which is of
independent interest, as was already mentioned above. In addition, all
L p -norm filters are still applicable: L 1 -norm reduces to the L 0 -norm
and L 2 -norm reduces to the square root of the L 0 -norm.
Surprisingly, the heaviest first strategy can also help prune
candidates for Weighted Intersection similarity, based on the obser-
vation that strings containing the heaviest query tokens are more likely
to exceed the threshold. Let query string v = λ 1 ···
λ v m and threshold
0 <θ≤ i =1 W ( λ i ), and assume without loss of generality that lists
are sorted in decreasing order of token weights (i.e., W ( λ 1 )
W ( λ v m )). Assume that there exists a string s that is contained only
in a su x L ( λ k ) ,...,L ( λ v m ) of token lists, whose aggregate weight
...
i = k W ( λ i ) . Clearly, such a string cannot exceed the query thresh-
old. An immediate conclusion is that
Lemma 6.16. Let query v = λ 1 ···λ v m s.t. W ( λ 1 )
≥ ··· ≥ W ( λ v m ), and
i =1 W ( λ i ). Let
threshold 0
|v|
W ( λ i ) ≥ θ.
π = arg max
1
π
m
i = π
Search WWH ::




Custom Search