Database Reference
In-Depth Information
EXERCISE 3.2.2 If we use the stop-word-based shingles of Section 3.2.4 , and we take the
stop words to be all the words of three or fewer letters, then what are the shingles in the
first sentence of Section 3.2 ?
EXERCISE 3.2.3 What is the largest number of k -shingles a document of n bytes can have?
You may assume that the size of the alphabet is large enough that the number of possible
strings of length k is at least as n .
3.3 Similarity-Preserving Summaries of Sets
Sets of shingles are large. Even if we hash them to four bytes each, the space needed to
store a set is still roughly four times the space taken by the document. If we have millions
of documents, it may well not be possible to store all the shingle-sets in main memory. 3
Our goal in this section is to replace large sets by much smaller representations called
“signatures.” The important property we need for signatures is that we can compare the
signatures of two sets and estimate the Jaccard similarity of the underlying sets from the
signatures alone. It is not possible that the signatures give the exact similarity of the sets
they represent, but the estimates they provide are close, and the larger the signatures the
more accurate the estimates. For example, if we replace the 200,000-byte hashed-shingle
sets that derive from 50,000-byte documents by signatures of 1000 bytes, we can usually
get within a few percent.
3.3.1
Matrix Representation of Sets
Before explaining how it is possible to construct small signatures from large sets, it is help-
ful to visualize a collection of sets as their characteristic matrix . The columns of the matrix
correspond to the sets, and the rows correspond to elements of the universal set from which
elements of the sets are drawn. There is a 1 in row r and column c if the element for row r
is a member of the set for column c . Otherwise the value in position ( r, c ) is 0.
EXAMPLE 3.6 In Fig. 3.2 is an example of a matrix representing sets chosen from the uni-
versal set { a, b, c, d, e }. Here, S 1 = { a, d }, S 2 = { c }, S 3 = { b, d, e }, and S 4 = { a, c, d }. The
top row and leftmost columns are not part of the matrix, but are present only to remind us
what the rows and columns represent.
Search WWH ::




Custom Search