Finding Similar Items - Mining of Massive Datasets

Database Reference

In-Depth Information

Figure 3.2 A matrix representing four sets

It is important to remember that the characteristic matrix is unlikely to be the way the

data is stored, but it is useful as a way to visualize the data. For one reason not to store data

as a matrix, these matrices are almost always sparse (they have many more 0's than 1's) in

practice. It saves space to represent a sparse matrix of 0's and 1's by the positions in which

the 1's appear. For another reason, the data is usually stored in some other format for other

purposes.

As an example, if rows are products, and columns are customers, represented by the set

of products they bought, then this data would really appear in a database table of purchases.

A tuple in this table would list the item, the purchaser, and probably other details about the

purchase, such as the date and the credit card used.

3.3.2

Minhashing

The signatures we desire to construct for sets are composed of the results of a large number

of calculations, say several hundred, each of which is a “minhash” of the characteristic

matrix. In this section, we shall learn how a minhash is computed in principle, and in later

sections we shall see how a good approximation to the minhash is computed in practice.

To minhash a set represented by a column of the characteristic matrix, pick a permutation

of the rows. The minhash value of any column is the number of the first row, in the per-

muted order, in which the column has a 1.

EXAMPLE 3.7 Let us suppose we pick the order of rows beadc for the matrix of Fig. 3.2 .

This permutation defines a minhash function h that maps sets to rows. Let us compute the

minhash value of set S 1 according to h . The first column, which is the column for set S 1 ,

has 0 in row b , so we proceed to row e , the second in the permuted order. There is again a

0 in the column for S 1 , so we proceed to row a , where we find a 1. Thus. h ( S 1 ) = a .

Although it is not physically possible to permute very large characteristic matrices, the

minhash function h implicitly reorders the rows of the matrix of Fig. 3.2 so it becomes the

matrix of Fig. 3.3 . In this matrix, we can read off the values of h by scanning from the top

until we come to a 1. Thus, we see that h ( S 2 ) = c , h ( S 3 ) = b , and h ( S 4 ) = a .

□

Search WWH ::

Custom Search

Home