Finding Similar Items - Mining of Massive Datasets

Database Reference

In-Depth Information

EXERCISE 3.2.2 If we use the stop-word-based shingles of Section 3.2.4 , and we take the

stop words to be all the words of three or fewer letters, then what are the shingles in the

first sentence of Section 3.2 ?

EXERCISE 3.2.3 What is the largest number of k -shingles a document of n bytes can have?

You may assume that the size of the alphabet is large enough that the number of possible

strings of length k is at least as n .

3.3 Similarity-Preserving Summaries of Sets

Sets of shingles are large. Even if we hash them to four bytes each, the space needed to

store a set is still roughly four times the space taken by the document. If we have millions

of documents, it may well not be possible to store all the shingle-sets in main memory. 3

Our goal in this section is to replace large sets by much smaller representations called

“signatures.” The important property we need for signatures is that we can compare the

signatures of two sets and estimate the Jaccard similarity of the underlying sets from the

signatures alone. It is not possible that the signatures give the exact similarity of the sets

they represent, but the estimates they provide are close, and the larger the signatures the

more accurate the estimates. For example, if we replace the 200,000-byte hashed-shingle

sets that derive from 50,000-byte documents by signatures of 1000 bytes, we can usually

get within a few percent.

3.3.1

Matrix Representation of Sets

Before explaining how it is possible to construct small signatures from large sets, it is help-

ful to visualize a collection of sets as their characteristic matrix . The columns of the matrix

correspond to the sets, and the rows correspond to elements of the universal set from which

elements of the sets are drawn. There is a 1 in row r and column c if the element for row r

is a member of the set for column c . Otherwise the value in position ( r, c ) is 0.

EXAMPLE 3.6 In Fig. 3.2 is an example of a matrix representing sets chosen from the uni-

versal set { a, b, c, d, e }. Here, S 1 = { a, d }, S 2 = { c }, S 3 = { b, d, e }, and S 4 = { a, c, d }. The

top row and leftmost columns are not part of the matrix, but are present only to remind us

what the rows and columns represent.

□

Search WWH ::

Custom Search

Home