Finding Similar Items - Mining of Massive Datasets

Database Reference

In-Depth Information

Thus, if our corpus of documents is emails, picking k = 5 should be fine. To see why,

suppose that only letters and a general white-space character appear in emails (although in

practice, most of the printable ASCII characters can be expected to appear occasionally). If

so, then there would be 27 5 = 14,348,907 possible shingles. Since the typical email is much

smaller than 14 million characters long, we would expect k = 5 to work well, and indeed it

does.

However, the calculation is a bit more subtle. Surely, more than 27 characters appear in

emails, However, all characters do not appear with equal probability. Common letters and

blanks dominate, while ”z” and other letters that have high point-value in Scrabble are rare.

Thus, even short emails will have many 5-shingles consisting of common letters, and the

chances of unrelated emails sharing these common shingles is greater than would be im-

plied by the calculation in the paragraph above. A good rule of thumb is to imagine that

there are only 20 characters and estimate the number of k -shingles as 20 k . For large docu-

ments, such as research articles, choice k = 9 is considered safe.

3.2.3

Hashing Shingles

Instead of using substrings directly as shingles, we can pick a hash function that maps

strings of length k to some number of buckets and treat the resulting bucket number as the

shingle. The set representing a document is then the set of integers that are bucket numbers

of one or more k -shingles that appear in the document. For instance, we could construct the

set of 9-shingles for a document and then map each of those 9-shingles to a bucket number

in the range 0 to 2 32 − 1. Thus, each shingle is represented by four bytes instead of nine.

Not only has the data been compacted, but we can now manipulate (hashed) shingles by

single-word machine operations.

Notice that we can differentiate documents better if we use 9-shingles and hash them

down to four bytes than to use 4-shingles, even though the space used to represent a shingle

is the same. The reason was touched upon in Section 3.2.2 . If we use 4-shingles, most se-

quences of four bytes are unlikely or impossible to find in typical documents. Thus, the

effective number of different shingles is much less than 2 32 − 1. If, as in Section 3.2.2 ,

we assume only 20 characters are frequent in English text, then the number of different

4-shingles that are likely to occur is only (20) 4 = 160,000. However, if we use 9-shingles,

there are many more than 2 32 likely shingles. When we hash them down to four bytes, we

can expect almost any sequence of four bytes to be possible, as was discussed in Section

1.3.2 .

Search WWH ::

Custom Search

Home