Database Reference
In-Depth Information
3.2.4
Shingles Built from Words
An alternative form of shingle has proved effective for the problem of identifying similar
news articles, mentioned in Section 3.1.2 . The exploitable distinction for this problem is
that the news articles are written in a rather different style than are other elements that typ-
ically appear on the page with the article. News articles, and most prose, have a lot of stop
words (see Section 1.3.1 ), the most common words such as “and,” “you,” “to,” and so on.
In many applications, we want to ignore stop words, since they don't tell us anything useful
about the article, such as its topic.
However, for the problem of finding similar news articles, it was found that defining a
shingle to be a stop word followed by the next two words, regardless of whether or not they
were stop words, formed a useful set of shingles. The advantage of this approach is that the
news article would then contribute more shingles to the set representing the Web page than
would the surrounding elements. Recall that the goal of the exercise is to find pages that
had the same articles, regardless of the surrounding elements. By biasing the set of shingles
in favor of the article, pages with the same article and different surrounding material have
higher Jaccard similarity than pages with the same surrounding material but with a differ-
ent article.
EXAMPLE 3.5 An ad might have the simple text “ Buy Sudzo .” However, a news article
with the same idea might read something like “ A spokesperson for the Sudzo Cor-
poration revealed today that studies have shown it is good for people to
buy Sudzo products .” Here we have italicized all the likely stop words, although
there is no set number of the most frequent words that should be considered stop words.
The first three shingles made from a stop word and the next two following are:
A spokesperson for
for the Sudzo
the Sudzo Corporation
There are nine shingles from the sentence, but none from the “ad.”
3.2.5
Exercises for Section 3.2
EXERCISE 3.2.1 What are the first ten 3-shingles in the first sentence of Section 3.2 ?
Search WWH ::




Custom Search