Database Reference
In-Depth Information
Thus, if our corpus of documents is emails, picking
k
= 5 should be fine. To see why,
suppose that only letters and a general white-space character appear in emails (although in
practice, most of the printable ASCII characters can be expected to appear occasionally). If
so, then there would be 27
5
= 14,348,907 possible shingles. Since the typical email is much
smaller than 14 million characters long, we would expect
k
= 5 to work well, and indeed it
does.
However, the calculation is a bit more subtle. Surely, more than 27 characters appear in
emails, However, all characters do not appear with equal probability. Common letters and
blanks dominate, while āzā and other letters that have high point-value in Scrabble are rare.
Thus, even short emails will have many 5-shingles consisting of common letters, and the
chances of unrelated emails sharing these common shingles is greater than would be im-
plied by the calculation in the paragraph above. A good rule of thumb is to imagine that
there are only 20 characters and estimate the number of
k
-shingles as 20
k
. For large docu-
ments, such as research articles, choice
k
= 9 is considered safe.
3.2.3
Hashing Shingles
Instead of using substrings directly as shingles, we can pick a hash function that maps
strings of length
k
to some number of buckets and treat the resulting bucket number as the
shingle. The set representing a document is then the set of integers that are bucket numbers
of one or more
k
-shingles that appear in the document. For instance, we could construct the
set of 9-shingles for a document and then map each of those 9-shingles to a bucket number
in the range 0 to 2
32
ā 1. Thus, each shingle is represented by four bytes instead of nine.
Not only has the data been compacted, but we can now manipulate (hashed) shingles by
single-word machine operations.
Notice that we can differentiate documents better if we use 9-shingles and hash them
down to four bytes than to use 4-shingles, even though the space used to represent a shingle
is the same. The reason was touched upon in
Section 3.2.2
. If we use 4-shingles, most se-
quences of four bytes are unlikely or impossible to find in typical documents. Thus, the
we assume only 20 characters are frequent in English text, then the number of different
4-shingles that are likely to occur is only (20)
4
= 160,000. However, if we use 9-shingles,
there are many more than 2
32
likely shingles. When we hash them down to four bytes, we
can expect almost any sequence of four bytes to be possible, as was discussed in
Section