Finding Similar Items - Mining of Massive Datasets

Database Reference

In-Depth Information

On-Line Purchases

Amazon.com has millions of customers and sells millions of items. Its database records

which items have been bought by which customers. We can say two customers are similar

if their sets of purchased items have a high Jaccard similarity. Likewise, two items that have

sets of purchasers with high Jaccard similarity will be deemed similar. Note that, while we

might expect mirror sites to have Jaccard similarity above 90%, it is unlikely that any two

customers have Jaccard similarity that high (unless they have purchased only one item).

Even a Jaccard similarity like 20% might be unusual enough to identify customers with

similar tastes. The same observation holds for items; Jaccard similarities need not be very

high to be significant.

Collaborative filtering requires several tools, in addition to finding similar customers or

items, as we discuss in Chapter 9 . For example, two Amazon customers who like science-

fiction might each buy many science-fiction topics, but only a few of these will be in com-

mon. However, by combining similarity-finding with clustering ( Chapter 7 ) , we might be

able to discover that science-fiction topics are mutually similar and put them in one group.

Then, we can get a more powerful notion of customer-similarity by asking whether they

made purchases within many of the same groups.

Movie Ratings

NetFlix records which movies each of its customers rented, and also the ratings assigned to

those movies by the customers. We can see movies as similar if they were rented or rated

highly by many of the same customers, and see customers as similar if they rented or rated

highly many of the same movies. The same observations that we made for Amazon above

apply in this situation: similarities need not be high to be significant, and clustering movies

by genre will make things easier.

When our data consists of ratings rather than binary decisions (bought/did not buy or

liked/disliked), we cannot rely simply on sets as representations of customers or items.

Some options are:

(1) Ignore low-rated customer/movie pairs; that is, treat these events as if the customer

never watched the movie.

(2) When comparing customers, imagine two set elements for each movie, “liked” and

“hated.” If a customer rated a movie highly, put the “liked” for that movie in the cus-

tomer's set. If they gave a low rating to a movie, put “hated” for that movie in their set.

Then, we can look for high Jaccard similarity among these sets. We can do a similar

trick when comparing movies.

Search WWH ::

Custom Search

Home