Algorithms for Set Based Similarity Using Inverted Indexes - Approximate String Processing

Database Reference

In-Depth Information

and 5.2). If the strings are treated as sequences, then each list element

is a string-identifier/token-position pair. If the strings are treated as

frequency-sets each list element is a string-identifier/token-frequency

pair. Finally, if the strings are treated as sets, each list element is simply

a string identifier. In what follows we consider algorithms for sequences

only, since sequences are the most general representation of strings. The

concepts and algorithms can be straightforwardly modified to work on

frequency-set and sets.

Depending on the algorithm used to evaluate the chosen similarity

function, the lists can be stored sequentially as a flat file, using a B-tree

or a hash table. They can also be sorted according to string identifiers

or any other information stored in the lists (e.g., the L p -norm of the

strings as will become clear shortly).

All-match selection queries can be answered quite eciently using

an inverted index. To answer the query, first the query string is tok-

enized using exactly the same tokenization scheme that was used for

the data strings. The data strings satisfying the similarity threshold

need to have at least one token in common with the query, hence only

the token lists corresponding to query tokens need to be involved in the

search. This results in significant pruning compared to the brute-force

algorithm.

6.1.1

Lists Sorted in String Identifier Order

Let S be a set of strings over which we build an inverted index. Let

v = λ 1 ···

λ v m ,λ i

L ( λ 1 ) ,...,L ( λ v m )

be

the m lists corresponding to the query tokens. By construction, each

list L ( λ i ) contains all string identifiers of strings s

∈

Λ be a query string and L v =

{

}

S s.t. λ i

s . The

simplest algorithm for evaluating the similarity between the query and

all the strings in lists L v , is to perform a multiway merge of the string

identifiers in L v , to compute all intersections v

∈

s . This will directly

yield the Weighted Intersection similarity. Whether we compute the

weighted intersection on sequences, frequency-sets or sets depends on

whether the initial construction of the inverted index was done on

sequences, frequency-sets, or sets. Clearly, the multiway merge com-

putation can be performed very eciently if the lists are already sorted

∩

Approximate String Processing

Search WWH ::

Custom Search

Home