Database Reference
In-Depth Information
and 5.2). If the strings are treated as sequences, then each list element
is a string-identifier/token-position pair. If the strings are treated as
frequency-sets each list element is a string-identifier/token-frequency
pair. Finally, if the strings are treated as sets, each list element is simply
a string identifier. In what follows we consider algorithms for sequences
only, since sequences are the most general representation of strings. The
concepts and algorithms can be straightforwardly modified to work on
frequency-set and sets.
Depending on the algorithm used to evaluate the chosen similarity
function, the lists can be stored sequentially as a flat file, using a B-tree
or a hash table. They can also be sorted according to string identifiers
or any other information stored in the lists (e.g., the L p -norm of the
strings as will become clear shortly).
All-match selection queries can be answered quite eciently using
an inverted index. To answer the query, first the query string is tok-
enized using exactly the same tokenization scheme that was used for
the data strings. The data strings satisfying the similarity threshold
need to have at least one token in common with the query, hence only
the token lists corresponding to query tokens need to be involved in the
search. This results in significant pruning compared to the brute-force
algorithm.
6.1.1
Lists Sorted in String Identifier Order
Let S be a set of strings over which we build an inverted index. Let
v = λ 1 ···
λ v m i
L ( λ 1 ) ,...,L ( λ v m )
be
the m lists corresponding to the query tokens. By construction, each
list L ( λ i ) contains all string identifiers of strings s
Λ be a query string and L v =
{
}
S s.t. λ i
s . The
simplest algorithm for evaluating the similarity between the query and
all the strings in lists L v , is to perform a multiway merge of the string
identifiers in L v , to compute all intersections v
s . This will directly
yield the Weighted Intersection similarity. Whether we compute the
weighted intersection on sequences, frequency-sets or sets depends on
whether the initial construction of the inverted index was done on
sequences, frequency-sets, or sets. Clearly, the multiway merge com-
putation can be performed very eciently if the lists are already sorted
Search WWH ::




Custom Search