Recommendation Systems - Mining of Massive Datasets

Databases Reference

In-Depth Information

which a user will like an item by the closeness of the item's profile to the

user's profile.

F

Classification of Items: An alternative to constructing a user profile is to

build a classifier for each user, e.g., a decision tree. The row of the utility

matrix for that user becomes the training data, and the classifier must

predict the response of the user to all items, whether or not the row had

an entry for that item.

F

Similarity of Rows and Columns of the Utility Matrix : Collaborative fil-

tering algorithms must measure the similarity of rows and/or columns

of the utility matrix. Jaccard distance is appropriate when the matrix

consists only of 1's and blanks (for “not rated”). Cosine distance works

for more general values in the utility matrix. It is often useful to normal-

ize the utility matrix by subtracting the average value (either by row, by

column, or both) before measuring the cosine distance.

F

Clustering Users and Items: Since the utility matrix tends to be mostly

blanks, distance measures such as Jaccard or cosine often have too little

data with which to compare two rows or two columns. A preliminary

step or steps, in which similarity is used to cluster users and/or items

into small groups with strong similarity, can help provide more common

components with which to compare rows or columns.

F

UV-Decomposition: One way of predicting the blank values in a utility

matrix is to find two long, thin matrices U and V , whose product is an

approximation to the given utility matrix. Since the matrix product U V

gives values for all user-item pairs, that value can be used to predict the

value of a blank in the utility matrix. The intuitive reason this method

makes sense is that often there are a relatively small number of issues (that

number is the “thin” dimension of U and V ) that determine whether or

not a user likes an item.

F

Root-Mean-Square Error : A good measure of how close the product U V

is to the given utility matrix is the RMSE (root-mean-square error). The

RMSE is computed by averaging the square of the differences between

U V and the utility matrix, in those elements where the utility matrix is

nonblank. The square root of this average is the RMSE.

F

Computing U and V : One way of finding a good choice for U and V in a

UV-decomposition is to start with arbitrary matrices U and V . Repeat-

edly adjust one of the elements of U or V to minimize the RMSE between

the product U V and the given utility matrix. The process converges to

a local optimum, although to have a good chance of obtaining a global

optimum we must either repeat the process from many starting matrices,

or search from the starting point in many different ways.

Mining of Massive Datasets

Search WWH ::

Custom Search

Home