Database Reference
In-Depth Information
the norm of the prefix). This implies that we would have to store two
norm values per entry, per inverted list, which increases the size of the
inverted index significantly.
The prefix signature of a string for the other similarity measures
follows.
Definition 7.3 (Prefix Signatures).
P θ ( s ) for
Normalized Weighted Intersection similarity is defined with
respect to:
Normalized Weighted Intersection: The prefix
l
W ( λ i )
π = arg max
1 ≤π≤l
θ
s
1 .
i = π
P θ ( s ) for Dice similarity is defined with
Dice: The prefix
respect to:
l
θ
2 − θ
W ( λ i )
π = arg max
1 ≤π≤l
s
1 .
i = π
P θ ( s ) for Cosine similarity is defined with
Cosine: The prefix
respect to:
l
W ( λ i ) 2
π = arg max
1 ≤π≤l
θ
s
2 .
i = π
An important concern regarding prefix signatures is that the token
ordering function significantly affects the pruning power of the prefix fil-
ter. Since the prefix filter evaluates the intersection between two prefix
signatures, the rarer the tokens contained in the prefixes are, the lower
the probability of a false positive intersection becomes. Rare tokens by
definition appear in a small number of prefixes, while frequent tokens
will tend to appear in the majority of prefixes. For that reason the
prefix filter is often used in combination with frequency based token
weights (e.g., idf weights).
 
Search WWH ::




Custom Search