Finding Similar Items - Mining of Massive Datasets

Databases Reference

In-Depth Information

set of integers that are bucket numbers of one or more k-shingles that appear

in the document. For instance, we could construct the set of 9-shingles for a

document and then map each of those 9-shingles to a bucket number in the

range 0 to 2 32 −1. Thus, each shingle is represented by four bytes instead

of nine. Not only has the data been compacted, but we can now manipulate

(hashed) shingles by single-word machine operations.

Notice that we can differentiate documents better if we use 9-shingles and

hash them down to four bytes than to use 4-shingles, even though the space used

to represent a shingle is the same. The reason was touched upon in Section 3.2.2.

If we use 4-shingles, most sequences of four bytes are unlikely or impossible to

find in typical documents. Thus, the effective number of different shingles is

much less than 2 32 −1. If, as in Section 3.2.2, we assume only 20 characters are

frequent in English text, then the number of different 4-shingles that are likely

to occur is only (20) 4 = 160,000. However, if we use 9-shingles, there are many

more than 2 32 likely shingles. When we hash them down to four bytes, we can

expect almost any sequence of four bytes to be possible, as was discussed in

Section 1.3.2.

3.2.4

Shingles Built from Words

An alternative form of shingle has proved effective for the problem of identifying

similar news articles, mentioned in Section 3.1.2. The exploitable distinction for

this problem is that the news articles are written in a rather different style than

are other elements that typically appear on the page with the article. News

articles, and most prose, have a lot of stop words (see Section 1.3.1), the most

common words such as “and,” “you,” “to,” and so on. In many applications,

we want to ignore stop words, since they don't tell us anything useful about

the article, such as its topic.

However, for the problem of finding similar news articles, it was found that

defining a shingle to be a stop word followed by the next two words, regardless

of whether or not they were stop words, formed a useful set of shingles. The

advantage of this approach is that the news article would then contribute more

shingles to the set representing the Web page than would the surrounding ele-

ments. Recall that the goal of the exercise is to find pages that had the same

articles, regardless of the surrounding elements. By biasing the set of shingles

in favor of the article, pages with the same article and different surrounding

material have higher Jaccard similarity than pages with the same surrounding

material but with a different article.

Example 3.5 : An ad might have the simple text “ Buy Sudzo .” However, a

news article with the same idea might read something like “A spokesperson

for the Sudzo Corporation revealed today that studies have shown it is

good for people to buy Sudzo products .” Here, we have italicized all the

likely stop words, although there is no set number of the most frequent words

that should be considered stop words. The first three shingles made from a

stop word and the next two following are:

Mining of Massive Datasets

Search WWH ::

Custom Search

Home