Database Reference
In-Depth Information
consequence that we are building the minimal inverted index needed for
identifying all answers with respect to any string in the second dataset.
The actual algorithms are the same as those discussed in Section 6.3
and hence a detailed discussion here is omitted.
When using the sorted block nested loop join algorithm with prefix
filters, there is an additional optimization that we can apply, which
reduces the size of the prefixes indexed by taking advantage of the fact
that the input datasets are sorted in increasing L p -norm order.
Definition 7.4(Jaccard Reduced Prefix). Let s = 1 ,...,λ s m } be
a string with its tokens sorted in increasing
order. Let θ be a Jaccard
similarity threshold. The reduced prefix of s is defined as
P J 2 θ
θ +1
( s ).
2 θ
For simplicity let φ =
θ +1 . The following is true:
Lemma 7.5. Given two strings s, r s.t.
s
r
1 , the reduced prefix
1
P φ ( s )of s and the prefix
P θ
( r )of r , it holds that
⇒P φ ( s )
∩P θ ( r )
J
( s, r )
θ
=
.
P φ ( s )
∩P θ ( r )=
Proof. Let
. It holds that
P φ ( s )
∩P θ ( r )
P φ ( s )
∩S θ ( r )
s
r
1 =
1 +
1
S φ ( s )
∩P θ ( r )
S φ ( s )
∩S θ ( r )
+
1 +
1
S φ ( s )
S θ ( r )
max(
1 ,
1 ) .
By definition
s
r
s
r
1
1
J
( s, r )=
1
.
s
1 +
r
s
r
2
s
s
r
1
1
1
Consider the two cases:
1 ≤S φ ( s )
2 θ
(1) Let
s
r
1 <
θ +1
s
1 . Then,
2 θ
θ +1
s
1
J ( s, r ) <
= θ.
2 θ
2
s
1
θ +1
s
1
 
Search WWH ::




Custom Search