Database Reference
In-Depth Information
For example, at s = 0 . 8, 1 − (0 . 8) 5 is about 0.672. If you raise this number to the 20th
power, you get about 0.00035. Subtracting this fraction from 1 yields 0.99965. That is,
if we consider two documents with 80% similarity, then in any one band, they have only
about a 33% chance of agreeing in all five rows and thus becoming a candidate pair.
However, there are 20 bands and thus 20 chances to become a candidate. Only roughly one
in 3000 pairs that are as high as 80% similar will fail to become a candidate pair and thus
be a false negative.
3.4.3
Combining the Techniques
We can now give an approach to finding the set of candidate pairs for similar documents
and then discovering the truly similar documents among them. It must be emphasized that
this approach can produce false negatives - pairs of similar documents that are not identi-
fied as such because they never become a candidate pair. There will also be false positives
- candidate pairs that are evaluated, but are found not to be sufficiently similar.
(1) Pick a value of k and construct from each document the set of k -shingles. Optionally,
hash the k -shingles to shorter bucket numbers.
(2) Sort the document-shingle pairs to order them by shingle.
(3) Pick a length n for the minhash signatures. Feed the sorted list to the algorithm of Sec-
tion 3.3.5 to compute the minhash signatures for all the documents.
(4) Choose a threshold t that defines how similar documents have to be in order for them
to be regarded as a desired “similar pair.” Pick a number of bands b and a number of
rows r such that br = n , and the threshold t is approximately (1/ b ) 1 /r . If avoidance of
false negatives is important, you may wish to select b and r to produce a threshold
lower than t ; if speed is important and you wish to limit false positives, select b and r
to produce a higher threshold.
(5) Construct candidate pairs by applying the LSH technique of Section 3.4.1 .
(6) Examine each candidate pair's signatures and determine whether the fraction of com-
ponents in which they agree is at least t .
(7) Optionally, if the signatures are sufficiently similar, go to the documents themselves
and check that they are truly similar, rather than documents that, by luck, had similar
signatures.
Search WWH ::




Custom Search