Database Reference
In-Depth Information
3.2.1
k -Shingles
A document is a string of characters. Define a k -shingle for a document to be any substring
of length k found within the document. Then, we may associate with each document the set
of k -shingles that appear one or more times within that document.
EXAMPLE 3.3 Suppose our document D is the string abcdabd , and we pick k = 2. Then
the set of 2-shingles for D is { ab , bc , cd , da , bd }.
Note that the substring ab appears twice within D , but appears only once as a shingle.
A variation of shingling produces a bag, rather than a set, so each shingle would appear in
the result as many times as it appears in the document. However, we shall not use bags of
shingles here.
There are several options regarding how white space (blank, tab, newline, etc.) is treated.
It probably makes sense to replace any sequence of one or more whitespace characters by
a single blank. That way, we distinguish shingles that cover two or more words from those
that do not.
EXAMPLE 3.4 If we use k = 9, but eliminate whitespace altogether, then we would see some
lexical similarity in the sentences “ The plane was ready for touch down ”.
and “ The quarterback scored a touchdown ”. However, if we retain the blanks,
then the first has shingles touch dow and ouch down , while the second has touch-
down . If we eliminated the blanks, then both would have touchdown .
3.2.2
Choosing the Shingle Size
We can pick k to be any constant we like. However, if we pick k too small, then we would
expect most sequences of k characters to appear in most documents. If so, then we could
have documents whose shingle-sets had high Jaccard similarity, yet the documents had
none of the same sentences or even phrases. As an extreme example, if we use k = 1, most
Web pages will have most of the common characters and few other characters, so almost
all Web pages will have high similarity.
How large k should be depends on how long typical documents are and how large the set
of typical characters is. The important thing to remember is:
k should be picked large enough that the probability of any given shingle appearing
in any given document is low.
Search WWH ::




Custom Search