Database Reference
In-Depth Information
3.9.1
Finding Identical Items
The extreme case is finding identical items, for example, Web pages that are identical,
character-for-character. It is straightforward to compare two documents and tell whether
they are identical, but we still must avoid having to compare every pair of documents. Our
first thought would be to hash documents based on their first few characters, and compare
only those documents that fell into the same bucket. That scheme should work well, unless
all the documents begin with the same characters, such as an HTML header.
Our second thought would be to use a hash function that examines the entire document.
That would work, and if we use enough buckets, it would be very rare that two documents
went into the same bucket, yet were not identical. The downside of this approach is that we
must examine every character of every document. If we limit our examination to a small
number of characters, then we never have to examine a document that is unique and falls
into a bucket of its own.
A better approach is to pick some fixed random positions for all documents, and make
the hash function depend only on these. This way, we can avoid a problem where there is a
common prefix for all or most documents, yet we need not examine entire documents un-
less they fall into a bucket with another document. One problem with selecting fixed posi-
tions is that if some documents are short, they may not have some of the selected positions.
However, if we are looking for highly similar documents, we never need to compare two
documents that differ significantly in their length. We exploit this idea in Section 3.9.3 .
3.9.2
Representing Sets as Strings
Now, let us focus on the harder problem of finding, in a large collection of sets, all pairs
that have a high Jaccard similarity, say at least 0.9. We can represent a set by sorting the
elements of the universal set in some fixed order, and representing any set by listing its ele-
ments in this order. The list is essentially a string of “characters,” where the characters are
the elements of the universal set. These strings are unusual, however, in that:
(1) No character appears more than once in a string, and
(2) If two characters appear in two different strings, then they appear in the same order in
both strings.
EXAMPLE 3.24 Suppose the universal set consists of the 26 lower-case letters, and we
use the normal alphabetical order. Then the set { d, a, b } is represented by the string
abd .
Search WWH ::




Custom Search