Database Reference
In-Depth Information
On-Line Purchases
Amazon.com has millions of customers and sells millions of items. Its database records
which items have been bought by which customers. We can say two customers are similar
if their sets of purchased items have a high Jaccard similarity. Likewise, two items that have
sets of purchasers with high Jaccard similarity will be deemed similar. Note that, while we
might expect mirror sites to have Jaccard similarity above 90%, it is unlikely that any two
customers have Jaccard similarity that high (unless they have purchased only one item).
Even a Jaccard similarity like 20% might be unusual enough to identify customers with
similar tastes. The same observation holds for items; Jaccard similarities need not be very
high to be significant.
Collaborative filtering requires several tools, in addition to finding similar customers or
items, as we discuss in Chapter 9 . For example, two Amazon customers who like science-
fiction might each buy many science-fiction topics, but only a few of these will be in com-
mon. However, by combining similarity-finding with clustering ( Chapter 7 ) , we might be
able to discover that science-fiction topics are mutually similar and put them in one group.
Then, we can get a more powerful notion of customer-similarity by asking whether they
made purchases within many of the same groups.
Movie Ratings
NetFlix records which movies each of its customers rented, and also the ratings assigned to
those movies by the customers. We can see movies as similar if they were rented or rated
highly by many of the same customers, and see customers as similar if they rented or rated
highly many of the same movies. The same observations that we made for Amazon above
apply in this situation: similarities need not be high to be significant, and clustering movies
by genre will make things easier.
When our data consists of ratings rather than binary decisions (bought/did not buy or
liked/disliked), we cannot rely simply on sets as representations of customers or items.
Some options are:
(1) Ignore low-rated customer/movie pairs; that is, treat these events as if the customer
never watched the movie.
(2) When comparing customers, imagine two set elements for each movie, “liked” and
“hated.” If a customer rated a movie highly, put the “liked” for that movie in the cus-
tomer's set. If they gave a low rating to a movie, put “hated” for that movie in their set.
Then, we can look for high Jaccard similarity among these sets. We can do a similar
trick when comparing movies.
Search WWH ::




Custom Search