Database Reference
In-Depth Information
In the special case where token weights are uniform (i.e., ∀λ ∈ Λ,
W ( λ )= c for some constant c ), notice that defining the order function
to be a function of W is meaningless. Nevertheless, we can still effec-
tively utilize prefix signatures, since s 1 = s 2 = cs 0 (where s 0
is the cardinality of the sequence, frequency-set, set). The reasoning of
prefix filtering remains exactly the same as before by considering pre-
fix and sux lengths as opposed to weights . Once again, in order to
improve the pruning effectiveness of the filter the order function
is
defined to be the decreasing token idf weight order (irrespective of the
uniform weighing function W that is used to compute string similarity).
7.1.1
All-Match Selection Queries
To answer selection queries we can build an inverted index on the prefix
signatures. Given a dataset of strings S , we build an inverted index on
the signatures of all s
S . The inverted index consists of one list per
token present in the union of prefix signatures. Given a query string
v = 1 ,...,λ v m }
and query threshold θ ≥ θ , first we build the prefix
signature
θ ( v ). The query candidate answer set consists of strings
contained in the union of token lists L ( λ i ) s.t. λ i ∈P θ ( v ), that satisfy
Lemma 7.4. After the candidate set has been produced, a refinement
step computes the exact similarity between the query and all candi-
dates.
Note here that when strings are represented as sequences, if the sim-
ilarity of strings is intended to be interpreted as the similarity of their
sequences, a token/position pair is a match if and only if it agrees both
on the token and the position (recall that how we interpret the repre-
sentation of strings affects the semantics of similarity). On the other
hand, when representing strings as frequency-sets (which are essentially
bags of tokens), a token/frequency pair is a match based on the token
alone (the frequency is not important, since one occurrence of the token
is enough to yield a non-empty intersection of prefixes).
A drawback of the prefix inverted index is that a minimum query
threshold θ needs to be determined in advance. Smaller θ by con-
struction implies longer prefixes on average, adversely affecting the
size of the inverted index. Answering a query v with threshold θ
P
θ
Search WWH ::




Custom Search