Recommendation Systems - Mining of Massive Datasets

Database Reference

In-Depth Information

was cited as an example. While one cannot tease out of the data information about

how long was the delay between viewing and rating, it is generally safe to assume

that most people see a movie shortly after it comes out. Thus, one can examine the

ratings of any movie to see if its ratings have an upward or downward slope with

time.

9.6 Summary of Chapter 9

✦ Utility Matrices : Recommendation systems deal with users and items. A utility matrix offers known information

about the degree to which a user likes an item. Normally, most entries are unknown, and the essential problem of re-

commending items to users is predicting the values of the unknown entries based on the values of the known entries.

✦ Two Classes of Recommendation Systems : These systems attempt to predict a user's response to an item by discov-

ering similar items and the response of the user to those. One class of recommendation system is content-based; it

measures similarity by looking for common features of the items. A second class of recommendation system uses

collaborative filtering; these measure similarity of users by their item preferences and/or measure similarity of items

by the users who like them.

✦ Item Profiles : These consist of features of items. Different kinds of items have different features on which content-

based similarity can be based. Features of documents are typically important or unusual words. Products have attrib-

utes such as screen size for a television. Media such as movies have a genre and details such as actor or performer.

Tags can also be used as features if they can be acquired from interested users.

✦ User Profiles : A content-based collaborative filtering system can construct profiles for users by measuring the fre-

quency with which features appear in the items the user likes. We can then estimate the degree to which a user will

like an item by the closeness of the item's profile to the user's profile.

✦ Classification of Items : An alternative to constructing a user profile is to build a classifier for each user, e.g., a de-

cision tree. The row of the utility matrix for that user becomes the training data, and the classifier must predict the

response of the user to all items, whether or not the row had an entry for that item.

✦ Similarity of Rows and Columns of the Utility Matrix : Collaborative filtering algorithms must measure the similarity

of rows and/or columns of the utility matrix. Jaccard distance is appropriate when the matrix consists only of 1s and

blanks (for “not rated”). Cosine distance works for more general values in the utility matrix. It is often useful to

normalize the utility matrix by subtracting the average value (either by row, by column, or both) before measuring

the cosine distance.

✦ Clustering Users and Items : Since the utility matrix tends to be mostly blanks, distance measures such as Jaccard

or cosine often have too little data with which to compare two rows or two columns. A preliminary step or steps, in

which similarity is used to cluster users and/or items into small groups with strong similarity, can help provide more

common components with which to compare rows or columns.

✦ UV-Decomposition : One way of predicting the blank values in a utility matrix is to find two long, thin matrices U

and V , whose product is an approximation to the given utility matrix. Since the matrix product UV gives values for

all user-item pairs, that value can be used to predict the value of a blank in the utility matrix. The intuitive reason this

method makes sense is that often there are a relatively small number of issues (that number is the “thin” dimension

of U and V ) that determine whether or not a user likes an item.

✦ Root-Mean-Square Error : A good measure of how close the product UV is to the given utility matrix is the RMSE

(root-mean-square error). The RMSE is computed by averaging the square of the differences between UV and the

utility matrix, in those elements where the utility matrix is nonblank. The square root of this average is the RMSE.

✦ Computing U and V : One way of finding a good choice for U and V in a UV-decomposition is to start with arbitrary

matrices U and V . Repeatedly adjust one of the elements of U or V to minimize the RMSE between the product UV

Mining of Massive Datasets

Search WWH ::

Custom Search

Home