Database Reference
In-Depth Information
Thus, if our corpus of documents is emails, picking k = 5 should be fine. To see why,
suppose that only letters and a general white-space character appear in emails (although in
practice, most of the printable ASCII characters can be expected to appear occasionally). If
so, then there would be 27 5 = 14,348,907 possible shingles. Since the typical email is much
smaller than 14 million characters long, we would expect k = 5 to work well, and indeed it
does.
However, the calculation is a bit more subtle. Surely, more than 27 characters appear in
emails, However, all characters do not appear with equal probability. Common letters and
blanks dominate, while ā€zā€ and other letters that have high point-value in Scrabble are rare.
Thus, even short emails will have many 5-shingles consisting of common letters, and the
chances of unrelated emails sharing these common shingles is greater than would be im-
plied by the calculation in the paragraph above. A good rule of thumb is to imagine that
there are only 20 characters and estimate the number of k -shingles as 20 k . For large docu-
ments, such as research articles, choice k = 9 is considered safe.
3.2.3
Hashing Shingles
Instead of using substrings directly as shingles, we can pick a hash function that maps
strings of length k to some number of buckets and treat the resulting bucket number as the
shingle. The set representing a document is then the set of integers that are bucket numbers
of one or more k -shingles that appear in the document. For instance, we could construct the
set of 9-shingles for a document and then map each of those 9-shingles to a bucket number
in the range 0 to 2 32 āˆ’ 1. Thus, each shingle is represented by four bytes instead of nine.
Not only has the data been compacted, but we can now manipulate (hashed) shingles by
single-word machine operations.
Notice that we can differentiate documents better if we use 9-shingles and hash them
down to four bytes than to use 4-shingles, even though the space used to represent a shingle
is the same. The reason was touched upon in Section 3.2.2 . If we use 4-shingles, most se-
quences of four bytes are unlikely or impossible to find in typical documents. Thus, the
effective number of different shingles is much less than 2 32 āˆ’ 1. If, as in Section 3.2.2 ,
we assume only 20 characters are frequent in English text, then the number of different
4-shingles that are likely to occur is only (20) 4 = 160,000. However, if we use 9-shingles,
there are many more than 2 32 likely shingles. When we hash them down to four bytes, we
can expect almost any sequence of four bytes to be possible, as was discussed in Section
1.3.2 .
Search WWH ::




Custom Search