Finding Similar Items - Mining of Massive Datasets

Database Reference

In-Depth Information

false-negative rate. For instance, we could take 2048 functions from F in two groups of

1024. Construct the buckets for each of the functions. However, given a fingerprint P on

the gun:

(1) Find the buckets from the first group in which P belongs, and take the union of these

buckets.

(2) Do the same for the second group.

(3) Take the intersection of the two unions.

(4) Compare P only with those fingerprints in the intersection.

Note that we still have to take unions and intersections of large sets of fingerprints, but we

compare only a small fraction of those. It is the comparison of fingerprints that takes the

bulk of the time; in steps (1) and (2) fingerprints can be represented by their integer indices

in the database.

If we use this scheme, the probability of detecting a matching fingerprint is (0 . 985) 2 =

0 . 970; that is, we get about 3% false negatives. However, the probability of a false positive

is (0 . 063) 2 = 0 . 00397. That is, we only have to examine about 1/250th of the database.

3.8.6

Similar News Articles

Our last case study concerns the problem of organizing a large repository of on-line news

articles by grouping together Web pages that were derived from the same basic text. It is

common for organizations like The Associated Press to produce a news item and distribute

it to many newspapers. Each newspaper puts the story in its on-line edition, but surrounds it

by information that is special to that newspaper, such as the name and address of the news-

paper, links to related articles, and links to ads. In addition, it is common for the newspaper

to modify the article, perhaps by leaving off the last few paragraphs or even deleting text

from the middle. As a result, the same news article can appear quite different at the Web

sites of different newspapers.

The problem looks very much like the one that was suggested in Section 3.4 : find doc-

uments whose shingles have a high Jaccard similarity. Note that this problem is different

from the problem of finding news articles that tell about the same events. The latter prob-

lem requires other techniques, typically examining the set of important words in the doc-

uments (a concept we discussed briefly in Section 1.3.1 ) and clustering them to group to-

gether different articles about the same topic.

However, an interesting variation on the theme of shingling was found to be more ef-

fective for data of the type described. The problem is that shingling as we described it in

Section 3.2 treats all parts of a document equally. However, we wish to ignore parts of the

Search WWH ::

Custom Search

Home