Database Reference
In-Depth Information
false-negative rate. For instance, we could take 2048 functions from F in two groups of
1024. Construct the buckets for each of the functions. However, given a fingerprint P on
the gun:
(1) Find the buckets from the first group in which P belongs, and take the union of these
buckets.
(2) Do the same for the second group.
(3) Take the intersection of the two unions.
(4) Compare P only with those fingerprints in the intersection.
Note that we still have to take unions and intersections of large sets of fingerprints, but we
compare only a small fraction of those. It is the comparison of fingerprints that takes the
bulk of the time; in steps (1) and (2) fingerprints can be represented by their integer indices
in the database.
If we use this scheme, the probability of detecting a matching fingerprint is (0 . 985) 2 =
0 . 970; that is, we get about 3% false negatives. However, the probability of a false positive
is (0 . 063) 2 = 0 . 00397. That is, we only have to examine about 1/250th of the database.
3.8.6
Similar News Articles
Our last case study concerns the problem of organizing a large repository of on-line news
articles by grouping together Web pages that were derived from the same basic text. It is
common for organizations like The Associated Press to produce a news item and distribute
it to many newspapers. Each newspaper puts the story in its on-line edition, but surrounds it
by information that is special to that newspaper, such as the name and address of the news-
paper, links to related articles, and links to ads. In addition, it is common for the newspaper
to modify the article, perhaps by leaving off the last few paragraphs or even deleting text
from the middle. As a result, the same news article can appear quite different at the Web
sites of different newspapers.
The problem looks very much like the one that was suggested in Section 3.4 : find doc-
uments whose shingles have a high Jaccard similarity. Note that this problem is different
from the problem of finding news articles that tell about the same events. The latter prob-
lem requires other techniques, typically examining the set of important words in the doc-
uments (a concept we discussed briefly in Section 1.3.1 ) and clustering them to group to-
gether different articles about the same topic.
However, an interesting variation on the theme of shingling was found to be more ef-
fective for data of the type described. The problem is that shingling as we described it in
Section 3.2 treats all parts of a document equally. However, we wish to ignore parts of the
Search WWH ::




Custom Search