Database Reference
In-Depth Information
document, such as ads or the headlines of other articles to which the newspaper added a
link, that are not part of the news article. It turns out that there is a noticeable difference
between text that appears in prose and text that appears in ads or headlines. Prose has a
much greater frequency of stop words, the very frequent words such as “the” or “and.” The
total number of words that are considered stop words varies with the application, but it is
common to use a list of several hundred of the most frequent words.
EXAMPLE 3.23 A typical ad might say simply “Buy Sudzo.” On the other hand, a prose
version of the same thought written for an article is “I recommend that you buy Sudzo for
your laundry.” In the latter sentence, it would be normal to treat “I,” “that,” “you,” “for,”
and “your” as stop words.
Suppose we define a shingle to be a stop word followed by the next two words. Then
the ad “Buy Sudzo” from Example 3.23 has no shingles and would not be reflected in the
representation of the Web page containing that ad. On the other hand, the sentence from
Example 3.23 would be represented by five shingles: “I recommend that,” “that you buy,”
“you buy Sudzo,” “for your laundry,” and “your laundry x ,” where x is whatever word fol-
lows that sentence.
Suppose we have two Web pages, each of which consists of half news text and half ads or
other material that has a low density of stop words. If the news text is the same but the sur-
rounding material is different, then we would expect that a large fraction of the shingles of
the two pages would be the same. They might have a Jaccard similarity of 75%. However,
if the surrounding material is the same but the news content is different, then the num-
ber of common shingles would be small, perhaps 25%. If we were to use the conventional
shingling, where shingles are (say) sequences of 10 consecutive characters, we would ex-
pect the two documents to share half their shingles (i.e., a Jaccard similarity of 1/3), regard-
less of whether it was the news or the surrounding material that they shared.
3.8.7
Exercises for Section 3.8
EXERCISE 3.8.1 Suppose we are trying to perform entity resolution among bibliographic
references, and we score pairs of references based on the similarities of their titles, list of
authors, and place of publication. Suppose also that all references include a year of public-
ation, and this year is equally likely to be any of the ten most recent years. Further, suppose
that we discover that among the pairs of references with a perfect score, there is an average
difference in the publication year of 0.1. 6 Suppose that the pairs of references with a certain
score s are found to have an average difference in their publication dates of 2. What is the
fraction of pairs with score s that truly represent the same publication? Note : Do not make
Search WWH ::




Custom Search