Finding Similar Items - Mining of Massive Datasets

Database Reference

In-Depth Information

3.9.1

Finding Identical Items

The extreme case is finding identical items, for example, Web pages that are identical,

character-for-character. It is straightforward to compare two documents and tell whether

they are identical, but we still must avoid having to compare every pair of documents. Our

first thought would be to hash documents based on their first few characters, and compare

only those documents that fell into the same bucket. That scheme should work well, unless

all the documents begin with the same characters, such as an HTML header.

Our second thought would be to use a hash function that examines the entire document.

That would work, and if we use enough buckets, it would be very rare that two documents

went into the same bucket, yet were not identical. The downside of this approach is that we

must examine every character of every document. If we limit our examination to a small

number of characters, then we never have to examine a document that is unique and falls

into a bucket of its own.

A better approach is to pick some fixed random positions for all documents, and make

the hash function depend only on these. This way, we can avoid a problem where there is a

common prefix for all or most documents, yet we need not examine entire documents un-

less they fall into a bucket with another document. One problem with selecting fixed posi-

tions is that if some documents are short, they may not have some of the selected positions.

However, if we are looking for highly similar documents, we never need to compare two

documents that differ significantly in their length. We exploit this idea in Section 3.9.3 .

3.9.2

Representing Sets as Strings

Now, let us focus on the harder problem of finding, in a large collection of sets, all pairs

that have a high Jaccard similarity, say at least 0.9. We can represent a set by sorting the

elements of the universal set in some fixed order, and representing any set by listing its ele-

ments in this order. The list is essentially a string of “characters,” where the characters are

the elements of the universal set. These strings are unusual, however, in that:

(1) No character appears more than once in a string, and

(2) If two characters appear in two different strings, then they appear in the same order in

both strings.

EXAMPLE 3.24 Suppose the universal set consists of the 26 lower-case letters, and we

use the normal alphabetical order. Then the set { d, a, b } is represented by the string

abd .

□

Search WWH ::

Custom Search

Home