Database Reference
In-Depth Information
initialized with one event s, 1 , 1 . 0 per string s ∈ S , and maintained in
decreasing order of
s .
The similarity upper bound for a given string s is computed by
considering the best case scenario. Abusing notation, let P
J
π ( s ))
denote the prefix (sux) of string s containing exactly the first π
tokens (all remaining tokens). By construction, the tokens in the cur-
rent indexed prefix P
π ( s )( S
π ( r ) of any string r that is not already contained
in the answer set, do not match any of the tokens in the already indexed
prefix
π ( s )of s . Then, the best-case similarity upper bound between
s and r is when all of the tokens in the current (unseen) sux S
P
π ( r )of
π ( s )of s . The potential Jaccard
r match those in the unknown sux
S
similarity becomes
π ( s )
π ( s )
S
1
1 S
1
J
( s, r )
.
π ( r )
P
π ( s )
1 +
P
1 +
S
π ( s )
s
1
Here, we need to compute the upper bound similarity of s with respect
to an arbitrary string r , thus based only on information in s ; drop-
ping the term
π ( r )
1 from the denominator and maximizing the
numerator yields the desired upper bound. Hence
P
π ( s )
π ( s )
s = S
=
s
−P
1
1
1
π
J
.
s 1
s 1
An interesting observation here is that if we assume unit token weights
we can improve the similarity upper bound based on the fact that
during each iteration of the algorithm the indexed prefixes consist of
exactly the same number of tokens for all strings. In that case we know
that π = π (i.e.,
π ( r )
π ( s )
P
1 =
P
1 ), hence the similarity upper
bound becomes
π ( s )
S
1
s =
π ( s ) 1 .
The algorithm proceeds by extracting the top event from the event
heap, say
J
s 1 + P
, and then scans inverted list L ( λ π ) to identify new
candidates. The exact similarity of each candidate is computed and
the top- k result heap is updated appropriately. Then, the algorithm
needs to index the new token λ π and compute the similarity upper
bound
s
s, π,
J
π +1
s
J
of s with any unseen string that does not agree with s
Search WWH ::




Custom Search