Database Reference
In-Depth Information
3.2.1
k
-Shingles
A document is a string of characters. Define a
k
-shingle for a document to be any substring
of length
k
found within the document. Then, we may associate with each document the set
of
k
-shingles that appear one or more times within that document.
EXAMPLE
3.3 Suppose our document
D
is the string
abcdabd
, and we pick
k
= 2. Then
the set of 2-shingles for
D
is {
ab
,
bc
,
cd
,
da
,
bd
}.
Note that the substring
ab
appears twice within
D
, but appears only once as a shingle.
A variation of shingling produces a bag, rather than a set, so each shingle would appear in
the result as many times as it appears in the document. However, we shall not use bags of
shingles here.
□
There are several options regarding how white space (blank, tab, newline, etc.) is treated.
It probably makes sense to replace any sequence of one or more whitespace characters by
a single blank. That way, we distinguish shingles that cover two or more words from those
that do not.
EXAMPLE
3.4 If we use
k
= 9, but eliminate whitespace altogether, then we would see some
lexical similarity in the sentences “
The plane was ready for touch down
”.
and “
The quarterback scored a touchdown
”. However, if we retain the blanks,
then the first has shingles
touch dow
and
ouch down
, while the second has
touch-
down
. If we eliminated the blanks, then both would have
touchdown
.
□
3.2.2
Choosing the Shingle Size
We can pick
k
to be any constant we like. However, if we pick
k
too small, then we would
expect most sequences of
k
characters to appear in most documents. If so, then we could
have documents whose shingle-sets had high Jaccard similarity, yet the documents had
none of the same sentences or even phrases. As an extreme example, if we use
k
= 1, most
Web pages will have most of the common characters and few other characters, so almost
all Web pages will have high similarity.
How large
k
should be depends on how long typical documents are and how large the set
of typical characters is. The important thing to remember is:
•
k
should be picked large enough that the probability of any given shingle appearing
in any given document is low.