Database Reference
In-Depth Information
(3) If ratings are 1-to-5-stars, put a movie in a customer's set n times if they rated the
movie n -stars. Then, use Jaccard similarity for bags when measuring the similarity of
customers. The Jaccard similarity for bags B and C is defined by counting an element
n times in the intersection if n is the minimum of the number of times the element ap-
pears in B and C . In the union, we count the element the sum of the number of times it
appears in B and in C . 2
EXAMPLE 3.2 The bag-similarity of bags { a, a, a, b } and { a, a, b, b, c } is 1/3. The intersec-
tion counts a twice and b once, so its size is 3. The size of the union of two bags is always
the sum of the sizes of the two bags, or 9 in this case. Since the highest possible Jaccard
similarity for bags is 1/2, the score of 1/3 indicates the two bags are quite similar, as should
be apparent from an examination of their contents.
3.1.4
Exercises for Section 3.1
EXERCISE 3.1.1 Compute the Jaccard similarities of each pair of the following three sets:
{1 , 2 , 3 , 4}, {2 , 3 , 5 , 7}, and {2 , 4 , 6}.
EXERCISE 3.1.2 Compute the Jaccard bag similarity of each pair of the following three
bags: {1 , 1 , 1 , 2}, {1 , 1 , 2 , 2 , 3}, and {1 , 2 , 3 , 4}.
!! EXERCISE 3.1.3 Suppose we have a universal set U of n elements, and we choose two
subsets S and T at random, each with m of the n elements. What is the expected value of
the Jaccard similarity of S and T ?
3.2 Shingling of Documents
The most effective way to represent documents as sets, for the purpose of identifying lex-
ically similar documents is to construct from the document the set of short strings that ap-
pear within it. If we do so, then documents that share pieces as short as sentences or even
phrases will have many common elements in their sets, even if those sentences appear in
different orders in the two documents. In this section, we introduce the simplest and most
common approach, shingling, as well as an interesting variation.
Search WWH ::




Custom Search