Algorithms for Set Based Similarity Using Filtering Techniques - Approximate String Processing

Database Reference

In-Depth Information

7.1.4

All-Match Self-join Queries

As with join queries, the algorithms for self-join queries are similar to

the ones discussed in Section 6.4, where the full inverted index con-

structed during the incremental indexing phase is simply replaced by

a prefix signature inverted index. Similarly, the prefix reduction prin-

ciple discussed in Section 7.1.3 can be employed for self-join queries by

sorting the input dataset in L p -norm order and incrementally indexing

only the reduced prefixes of strings (see Algorithm 6.4.1).

7.1.5

Top-

k

Join and Self-join Queries

Once again top- k join and self-join queries can be answered by using

the all-match join and self-join algorithms with prefix inverted indexes

instead of full inverted indexes. The algorithms simply identify any k

candidates during initialization, and use the k -th smallest similarity as

the initial threshold θ . Notice that in the worst case the initial k -th

similarity can be 0, and the prefix inverted index will degenerate to a

full inverted index (each prefix is the complete string). In fact, there is

no better solution known when the input datasets do not fit in main

memory.

Nevertheless, there exists another effective strategy when there is

enough available memory to fit one dataset and its prefix inverted

index in main memory. The strategy uses an optimistic approach that

assumes that the k -th smallest similarity is the maximum possible sim-

ilarity (e.g., one for Normalized Weighted Intersection, Jaccard, Dice,

and Cosine similarity), and starts indexing prefixes incrementally, one

token at a time. This strategy is effective only in main memory, since

the algorithm has to repeatedly access each string when it is time to

index its next prefix token. The top- k self-join algorithm for Jaccard

similarity is shown as Algorithm 7.1.1. The same algorithm, with slight

modifications, can be used to answer top- k join queries between two sets

S,R as well (e.g., a simple way to do this is to run a slightly modified

self-join algorithm on the union S

R ). It can work for Weighted Inter-

section, Normalized Weighted Intersection, Dice and Cosine similarity.

In the beginning the algorithm indexes only the first token of each

string. When a new token s π from a given string s is indexed, the

∪

Search WWH ::

Custom Search

Home