Finding Similar Items - Mining of Massive Datasets

Database Reference

In-Depth Information

3.4 Locality-Sensitive Hashing for Documents

Even though we can use minhashing to compress large documents into small signatures

and preserve the expected similarity of any pair of documents, it still may be impossible to

find the pairs with greatest similarity efficiently. The reason is that the number of pairs of

documents may be too large, even if there are not too many documents.

EXAMPLE 3.9 Suppose we have a million documents, and we use signatures of length 250.

Then we use 1000 bytes per document for the signatures, and the entire data fits in a giga-

byte - less than a typical main memory of a laptop. However, there are or half a

trillion pairs of documents. If it takes a microsecond to compute the similarity of two sig-

natures, then it takes almost six days to compute all the similarities on that laptop.

□

If our goal is to compute the similarity of every pair, there is nothing we can do to reduce

the work, although parallelism can reduce the elapsed time. However, often we want only

the most similar pairs or all pairs that are above some lower bound in similarity. If so, then

we need to focus our attention only on pairs that are likely to be similar, without investigat-

ing every pair. There is a general theory of how to provide such focus, called locality-sens-

itive hashing (LSH) or near - neighbor search . In this section we shall consider a specific

form of LSH, designed for the particular problem we have been studying: documents, rep-

resented by shingle-sets, then minhashed to short signatures. In Section 3.6 we present the

general theory of locality-sensitive hashing and a number of applications and related tech-

niques.

3.4.1

LSH for Minhash Signatures

One general approach to LSH is to “hash” items several times, in such a way that similar

items are more likely to be hashed to the same bucket than dissimilar items are. We then

consider any pair that hashed to the same bucket for any of the hashings to be a candidate

pair . We check only the candidate pairs for similarity. The hope is that most of the dissim-

ilar pairs will never hash to the same bucket, and therefore will never be checked. Those

dissimilar pairs that do hash to the same bucket are false positives ; we hope these will be

only a small fraction of all pairs. We also hope that most of the truly similar pairs will hash

to the same bucket under at least one of the hash functions. Those that do not are false neg-

atives ; we hope these will be only a small fraction of the truly similar pairs.

If we have minhash signatures for the items, an effective way to choose the hashings is

to divide the signature matrix into b bands consisting of r rows each. For each band, there is

a hash function that takes vectors of r integers (the portion of one column within that band)

Search WWH ::

Custom Search

Home