Database Reference
In-Depth Information
Figure 3.2 A matrix representing four sets
It is important to remember that the characteristic matrix is unlikely to be the way the
data is stored, but it is useful as a way to visualize the data. For one reason not to store data
as a matrix, these matrices are almost always sparse (they have many more 0's than 1's) in
practice. It saves space to represent a sparse matrix of 0's and 1's by the positions in which
the 1's appear. For another reason, the data is usually stored in some other format for other
purposes.
As an example, if rows are products, and columns are customers, represented by the set
of products they bought, then this data would really appear in a database table of purchases.
A tuple in this table would list the item, the purchaser, and probably other details about the
purchase, such as the date and the credit card used.
3.3.2
Minhashing
The signatures we desire to construct for sets are composed of the results of a large number
of calculations, say several hundred, each of which is a “minhash” of the characteristic
matrix. In this section, we shall learn how a minhash is computed in principle, and in later
sections we shall see how a good approximation to the minhash is computed in practice.
To minhash a set represented by a column of the characteristic matrix, pick a permutation
of the rows. The minhash value of any column is the number of the first row, in the per-
muted order, in which the column has a 1.
EXAMPLE 3.7 Let us suppose we pick the order of rows beadc for the matrix of Fig. 3.2 .
This permutation defines a minhash function h that maps sets to rows. Let us compute the
minhash value of set S 1 according to h . The first column, which is the column for set S 1 ,
has 0 in row b , so we proceed to row e , the second in the permuted order. There is again a
0 in the column for S 1 , so we proceed to row a , where we find a 1. Thus. h ( S 1 ) = a .
Although it is not physically possible to permute very large characteristic matrices, the
minhash function h implicitly reorders the rows of the matrix of Fig. 3.2 so it becomes the
matrix of Fig. 3.3 . In this matrix, we can read off the values of h by scanning from the top
until we come to a 1. Thus, we see that h ( S 2 ) = c , h ( S 3 ) = b , and h ( S 4 ) = a .
Search WWH ::




Custom Search