Database Reference
In-Depth Information
2. An unseen string s appears only in L ( λ 1 ) ,L ( λ 2 ). By con-
struction
s
f 2
⇒N
( s, v ) .
1
1
3. ...
m . An unseen string s appears only in L ( λ 1 ) ,...,L ( λ v m ). By con-
struction
s
1 >
f m
⇒N
( s, v ) .
1
We can also use Observation 6.1 to improve the pruning power of
the nra algorithm by computing a best-case similarity for all candidate
strings before and after they have been inserted in the candidate set.
Strings whose best-case similarity is below the threshold can be pruned
immediately, reducing the memory requirements and the topic-keeping
cost of the algorithm. The best-case similarity of a new candidate uses
Observation 6.1 to identify lists that do not contain that candidate.
Lemma 6.4. Given query v , candidate string s , and a subset of lists
L v
L v potentially containing s (based on Observation 6.1 and the
frontier elements f i ), the best-case similarity score for s is
v 1
b ( s, v )=
N
1 ) .
max(
s
1 ,
v
It is important to note here that the above reasoning is correct if
and only if each string entry appears only once in every token list.
This is indeed the case for the frequency set and set representations,
but not always true for sequences, when the positional information is
disregarded (e.g., for computing the similarity of strings without regard
for positions). In that case, in order to compute the termination con-
dition correctly, we need to know for an arbitrary frontier element the
maximum number of times that the element might be contained in a
given list. The maximum frequency of an arbitrary entry per inverted
list can be computed at list construction time. Alternatively, we can
assume that a given token appears in a data string at most as many
times as it appears in the query, as a worst case scenario, and as a
result overestimate the similarity of the frontier elements.
An alternative implementation of the nra algorithm can postpone
updating candidate bit vectors and purging candidates from M until
 
Search WWH ::




Custom Search