Databases Reference
In-Depth Information
set of integers that are bucket numbers of one or more k-shingles that appear
in the document. For instance, we could construct the set of 9-shingles for a
document and then map each of those 9-shingles to a bucket number in the
range 0 to 2 32 −1. Thus, each shingle is represented by four bytes instead
of nine. Not only has the data been compacted, but we can now manipulate
(hashed) shingles by single-word machine operations.
Notice that we can differentiate documents better if we use 9-shingles and
hash them down to four bytes than to use 4-shingles, even though the space used
to represent a shingle is the same. The reason was touched upon in Section 3.2.2.
If we use 4-shingles, most sequences of four bytes are unlikely or impossible to
find in typical documents. Thus, the effective number of different shingles is
much less than 2 32 −1. If, as in Section 3.2.2, we assume only 20 characters are
frequent in English text, then the number of different 4-shingles that are likely
to occur is only (20) 4 = 160,000. However, if we use 9-shingles, there are many
more than 2 32 likely shingles. When we hash them down to four bytes, we can
expect almost any sequence of four bytes to be possible, as was discussed in
Section 1.3.2.
3.2.4
Shingles Built from Words
An alternative form of shingle has proved effective for the problem of identifying
similar news articles, mentioned in Section 3.1.2. The exploitable distinction for
this problem is that the news articles are written in a rather different style than
are other elements that typically appear on the page with the article. News
articles, and most prose, have a lot of stop words (see Section 1.3.1), the most
common words such as “and,” “you,” “to,” and so on. In many applications,
we want to ignore stop words, since they don't tell us anything useful about
the article, such as its topic.
However, for the problem of finding similar news articles, it was found that
defining a shingle to be a stop word followed by the next two words, regardless
of whether or not they were stop words, formed a useful set of shingles. The
advantage of this approach is that the news article would then contribute more
shingles to the set representing the Web page than would the surrounding ele-
ments. Recall that the goal of the exercise is to find pages that had the same
articles, regardless of the surrounding elements. By biasing the set of shingles
in favor of the article, pages with the same article and different surrounding
material have higher Jaccard similarity than pages with the same surrounding
material but with a different article.
Example 3.5 : An ad might have the simple text “ Buy Sudzo .” However, a
news article with the same idea might read something like “A spokesperson
for the Sudzo Corporation revealed today that studies have shown it is
good for people to buy Sudzo products .” Here, we have italicized all the
likely stop words, although there is no set number of the most frequent words
that should be considered stop words. The first three shingles made from a
stop word and the next two following are:
Search WWH ::




Custom Search