Databases Reference
In-Depth Information
Probability
of becoming
a candidate
0
1
Jaccard similarity
of documents
Figure 3.7: The S-curve
1−(1−s r ) b
s
.2
.006
.3
.047
.4
.186
.5
.470
.6
.802
.7
.975
.8
.9996
Figure 3.8: Values of the S-curve for b = 20 and r = 5
3.4.3
Combining the Techniques
We can now give an approach to finding the set of candidate pairs for similar
documents and then discovering the truly similar documents among them. It
must be emphasized that this approach can produce false negatives - pairs of
similar documents that are not identified as such because they never become
a candidate pair. There will also be false positives - candidate pairs that are
evaluated, but are found not to be su ciently similar.
1. Pick a value of k and construct from each document the set of k-shingles.
Optionally, hash the k-shingles to shorter bucket numbers.
2. Sort the document-shingle pairs to order them by shingle.
3. Pick a length n for the minhash signatures. Feed the sorted list to the
algorithm of Section 3.3.5 to compute the minhash signatures for all the
Search WWH ::




Custom Search