Databases Reference
In-Depth Information
Element
S 1
S 2
S 3
S 4
a
1
0
0
1
b
0
0
1
0
c
0
1
0
1
d
1
0
1
1
e
0
0
1
0
Figure 3.2: A matrix representing four sets
Example 3.6 : In Fig. 3.2 is an example of a matrix representing sets chosen
from the universal set{a, b, c, d, e}. Here, S 1 ={a, d}, S 2 ={c}, S 3 ={b, d, e},
and S 4 ={a, c, d}. The top row and leftmost columns are not part of the matrix,
but are present only to remind us what the rows and columns represent.
2
It is important to remember that the characteristic matrix is unlikely to be
the way the data is stored, but it is useful as a way to visualize the data. For one
reason not to store data as a matrix, these matrices are almost always sparse
(they have many more 0's than 1's) in practice. It saves space to represent a
sparse matrix of 0's and 1's by the positions in which the 1's appear. For another
reason, the data is usually stored in some other format for other purposes.
As an example, if rows are products, and columns are customers, represented
by the set of products they bought, then this data would really appear in a
database table of purchases. A tuple in this table would list the item, the
purchaser, and probably other details about the purchase, such as the date and
the credit card used.
3.3.2 Minhashing
The signatures we desire to construct for sets are composed of the results of a
large number of calculations, say several hundred, each of which is a “minhash”
of the characteristic matrix. In this section, we shall learn how a minhash is
computed in principle, and in later sections we shall see how a good approxi-
mation to the minhash is computed in practice.
To minhash a set represented by a column of the characteristic matrix, pick
a permutation of the rows. The minhash value of any column is the number of
the first row, in the permuted order, in which the column has a 1.
Example 3.7 : Let us suppose we pick the order of rows beadc for the matrix
of Fig. 3.2. This permutation defines a minhash function h that maps sets to
rows. Let us compute the minhash value of set S 1 according to h. The first
column, which is the column for set S 1 , has 0 in row b, so we proceed to row e,
the second in the permuted order. There is again a 0 in the column for S 1 , so
we proceed to row a, where we find a 1. Thus. h(S 1 ) = a.
Although it is not physically possible to permute very large characteristic
matrices, the minhash function h implicitly reorders the rows of the matrix of
Search WWH ::




Custom Search