Database Reference
In-Depth Information
examined. This optimization is irrelevant when indexing frequency-sets
or sets. In addition, one could allow token positions to vary within some
bounds of the corresponding token position in the query, in order to
evaluate sequence based similarity measures that allow tokens to have
slightly different positions in the query and the resulting data strings
(e.g., recall the example 'The Bill & Melinda Gates Foundations' and
'The Melinda & Bill Gates Foundation'). An important detail that is
missing from Algorithm 6.1.1 for simplicity, is that whenever a given
token λ appears multiple times in different positions, the algorithm
should scan list L ( λ ) once, simultaneously for all positions. We omit
this detail in all subsequent algorithms as well.
Computing Normalized Weighted Intersection is also straightforward,
provided that the L 1 -norm of every data string is stored in the token lists
(see Definition 2.6). Hence, the lists store tuples string-identifier/token-
position/ L 1 -norm per string for sequences, string-identifier/token-
frequency/ L 1 -norm for frequency-sets, and string-identifier/ L 1 -norm
for sets. Once again, if lists are already sorted in increasing order of string
identifiers, a multiway merge algorithm can evaluate the normalized
weighted intersection between the query and all relevant strings very e-
ciently. The algorithm for computing normalized weighted intersection
essentially boils down to computing Weighted Intersection and hence is
very similar to Algorithm 6.1.1. Similar arguments hold for Jaccard, Dice
and Cosine similarity.
A number of optimizations are possible in case of unit weights (i.e.,
when ∀λ ∈ Λ ,W ( λ ) = 1). For unit weights, the problem of identifying
strings with similarity exceeding the query threshold θ is equivalent to
the problem of identifying strings that appear in at least θ lists (since
all the lists have the same weight, it does not matter which list a string
appears in). In this case, in every iteration of the multiway merge algo-
rithm, first we pop all occurrences of the top element of the heap. If that
element occurs at least θ times we report it. Otherwise, assume that it
appears a total of x times. We pop from the heap another θ
x ele-
ments, identify the new top element of the heap, let it be r , and directly
skip to the first element s
1
r in every token list (we can identify such
elements using binary search on the respective lists). Finally, we insert
the new top elements from every list for which a skip actually occurred
Search WWH ::




Custom Search