Algorithms for Edit Based Similarity - Approximate String Processing

Database Reference

In-Depth Information

When using the list merging algorithms proposed in Section 6 for

set based similarity, the resulting set of strings forms the exact answer

to the query. In contrast, for edit based similarity, the resulting set of

strings is a super set of the exact answer, since the various filtering

techniques are used to prune candidate strings, but not to compute

the actual edit distance between the query and each string (that would

only be possible if the algorithm maintained all the discovered q -gram

information per string, which would make the topic-keeping cost pro-

hibitive). This implies that a verification step is necessary in order

to filter out false positives. The verification step is performed by ver-

ifying that the edit distance between each candidate string and the

query is indeed smaller than or equal to the query threshold. For that

purpose we can use the fast edit distance verification algorithm pre-

sented in Section 2.1. This is in contrast to the same algorithms used

for set based similarity measures, where the actual similarity is com-

puted during index traversal, and no subsequent verification step is

needed.

One pitfall of this approach is that for a large enough threshold

θ , short query lengths and an inappropriate q -gram length, the q -

gram intersection constraint might return a non-positive value. In other

words, potential answers might not have any q -grams in common with

the query. Given that the list merging algorithms examine only those

inverted lists that correspond to query q -grams, answers that do not

share any q -grams with the query will never be discovered. In this case,

only a linear scan of the inverted index would be able to identify all

answers. For example, assume that θ =1 ,q = 2, and v = abc . Then,

according to Lemma 8.1, potential answers to the query need to have

at least 3 − 2+1 − 2=0 q -grams in common. String s = adc is a valid

answer for this query, yet it does not have any q -grams in common with

v . Similar counter-examples exist for the case of extended q -gram sets.

Finally, it should be noted here that, as a consequence of using

q -grams to evaluate the edit distance, these techniques cannot be

extended for evaluating either weighted (i.e., where each edit opera-

tion has a specific weight) or normalized edit distance (i.e., where the

distance is normalized by the length of the strings).

Search WWH ::

Custom Search

Home