Finding Similar Items - Mining of Massive Datasets

Database Reference

In-Depth Information

document, such as ads or the headlines of other articles to which the newspaper added a

link, that are not part of the news article. It turns out that there is a noticeable difference

between text that appears in prose and text that appears in ads or headlines. Prose has a

much greater frequency of stop words, the very frequent words such as “the” or “and.” The

total number of words that are considered stop words varies with the application, but it is

common to use a list of several hundred of the most frequent words.

EXAMPLE 3.23 A typical ad might say simply “Buy Sudzo.” On the other hand, a prose

version of the same thought written for an article is “I recommend that you buy Sudzo for

your laundry.” In the latter sentence, it would be normal to treat “I,” “that,” “you,” “for,”

and “your” as stop words.

□

Suppose we define a shingle to be a stop word followed by the next two words. Then

the ad “Buy Sudzo” from Example 3.23 has no shingles and would not be reflected in the

representation of the Web page containing that ad. On the other hand, the sentence from

Example 3.23 would be represented by five shingles: “I recommend that,” “that you buy,”

“you buy Sudzo,” “for your laundry,” and “your laundry x ,” where x is whatever word fol-

lows that sentence.

Suppose we have two Web pages, each of which consists of half news text and half ads or

other material that has a low density of stop words. If the news text is the same but the sur-

rounding material is different, then we would expect that a large fraction of the shingles of

the two pages would be the same. They might have a Jaccard similarity of 75%. However,

if the surrounding material is the same but the news content is different, then the num-

ber of common shingles would be small, perhaps 25%. If we were to use the conventional

shingling, where shingles are (say) sequences of 10 consecutive characters, we would ex-

pect the two documents to share half their shingles (i.e., a Jaccard similarity of 1/3), regard-

less of whether it was the news or the surrounding material that they shared.

3.8.7

Exercises for Section 3.8

EXERCISE 3.8.1 Suppose we are trying to perform entity resolution among bibliographic

references, and we score pairs of references based on the similarities of their titles, list of

authors, and place of publication. Suppose also that all references include a year of public-

ation, and this year is equally likely to be any of the ten most recent years. Further, suppose

that we discover that among the pairs of references with a perfect score, there is an average

difference in the publication year of 0.1. 6 Suppose that the pairs of references with a certain

score s are found to have an average difference in their publication dates of 2. What is the

fraction of pairs with score s that truly represent the same publication? Note : Do not make

Search WWH ::

Custom Search

Home