Database Reference
In-Depth Information
When using the list merging algorithms proposed in Section 6 for
set based similarity, the resulting set of strings forms the exact answer
to the query. In contrast, for edit based similarity, the resulting set of
strings is a super set of the exact answer, since the various filtering
techniques are used to prune candidate strings, but not to compute
the actual edit distance between the query and each string (that would
only be possible if the algorithm maintained all the discovered q -gram
information per string, which would make the topic-keeping cost pro-
hibitive). This implies that a verification step is necessary in order
to filter out false positives. The verification step is performed by ver-
ifying that the edit distance between each candidate string and the
query is indeed smaller than or equal to the query threshold. For that
purpose we can use the fast edit distance verification algorithm pre-
sented in Section 2.1. This is in contrast to the same algorithms used
for set based similarity measures, where the actual similarity is com-
puted during index traversal, and no subsequent verification step is
needed.
One pitfall of this approach is that for a large enough threshold
θ , short query lengths and an inappropriate q -gram length, the q -
gram intersection constraint might return a non-positive value. In other
words, potential answers might not have any q -grams in common with
the query. Given that the list merging algorithms examine only those
inverted lists that correspond to query q -grams, answers that do not
share any q -grams with the query will never be discovered. In this case,
only a linear scan of the inverted index would be able to identify all
answers. For example, assume that θ =1 ,q = 2, and v = abc . Then,
according to Lemma 8.1, potential answers to the query need to have
at least 3 2+1 2=0 q -grams in common. String s = adc is a valid
answer for this query, yet it does not have any q -grams in common with
v . Similar counter-examples exist for the case of extended q -gram sets.
Finally, it should be noted here that, as a consequence of using
q -grams to evaluate the edit distance, these techniques cannot be
extended for evaluating either weighted (i.e., where each edit opera-
tion has a specific weight) or normalized edit distance (i.e., where the
distance is normalized by the length of the strings).
Search WWH ::




Custom Search