Database Reference
In-Depth Information
Figure 3.3 A permutation of the rows of Fig. 3.2
3.3.3
Minhashing and Jaccard Similarity
There is a remarkable connection between minhashing and Jaccard similarity of the sets
that are minhashed.
• The probability that the minhash function for a random permutation of rows pro-
duces the same value for two sets equals the Jaccard similarity of those sets.
To see why, we need to picture the columns for those two sets. If we restrict ourselves to
the columns for sets S 1 and S 2 , then rows can be divided into three classes:
(1) Type X rows have 1 in both columns.
(2) Type Y rows have 1 in one of the columns and 0 in the other.
(3) Type Z rows have 0 in both columns.
Since the matrix is sparse, most rows are of type Z . However, it is the ratio of the num-
bers of type X and type Y rows that determine both SIM( S 1 , S 2 ) and the probability that
h ( S 1 ) = h ( S 2 ). Let there be x rows of type X and y rows of type Y . Then SIM( S 1 , S 2 ) = x /( x
+ y ). The reason is that x is the size of S 1 S 2 and x + y is the size of S 1 S 2 .
Now, consider the probability that h ( S 1 ) = h ( S 2 ). If we imagine the rows permuted ran-
domly, and we proceed from the top, the probability that we shall meet a type X row before
we meet a type Y row is x /( x + y ). But if the first row from the top other than type Z rows is
a type X row, then surely h ( S 1 ) = h ( S 2 ). On the other hand, if the first row other than a type
Z row that we meet is a type Y row, then the set with a 1 gets that row as its minhash value.
However the set with a 0 in that row surely gets some row further down the permuted list.
Thus, we know h ( S 1 ) ≠ h ( S 2 ) if we first meet a type Y row. We conclude the probability that
h ( S 1 ) = h ( S 2 ) is x /( x + y ), which is also the Jaccard similarity of S 1 and S 2 .
3.3.4
Minhash Signatures
Again think of a collection of sets represented by their characteristic matrix M . To represent
sets, we pick at random some number n of permutations of the rows of M . Perhaps 100
permutations or several hundred permutations will do. Call the minhash functions determ-
ined by these permutations h 1 , h 2 , . . . , h n . From the column representing set S , construct
the minhash signature for S , the vector [ h 1 ( S ) , h 2 ( S ) , . . . , h n ( S )]. We normally represent
this list of hash-values as a column. Thus, we can form from matrix M a signature matrix ,
Search WWH ::




Custom Search