Recommendation Systems - Mining of Massive Datasets

Database Reference

In-Depth Information

9.2.4

Representing Item Profiles

Our ultimate goal for content-based recommendation is to create both an item profile con-

sisting of feature-value pairs and a user profile summarizing the preferences of the user,

based of their row of the utility matrix. In Section 9.2.2 we suggested how an item pro-

file could be constructed. We imagined a vector of 0s and 1s, where a 1 represented the

occurrence of a high-TF.IDF word in the document. Since features for documents were all

words, it was easy to represent profiles this way.

We shall try to generalize this vector approach to all sorts of features. It is easy to do so

for features that are sets of discrete values. For example, if one feature of movies is the set

of actors, then imagine that there is a component for each actor, with 1 if the actor is in

the movie, and 0 if not. Likewise, we can have a component for each possible director, and

each possible genre. All these features can be represented using only 0s and 1s.

There is another class of features that is not readily represented by boolean vectors: those

features that are numerical. For instance, we might take the average rating for movies to

be a feature, 2 and this average is a real number. It does not make sense to have one com-

ponent for each of the possible average ratings, and doing so would cause us to lose the

structure implicit in numbers. That is, two ratings that are close but not identical should

be considered more similar than widely differing ratings. Likewise, numerical features of

products, such as screen size or disk capacity for PC's, should be considered similar if their

values do not differ greatly.

Numerical features should be represented by single components of vectors representing

items. These components hold the exact value of that feature. There is no harm if some

components of the vectors are boolean and others are real-valued or integer-valued. We can

still compute the cosine distance between vectors, although if we do so, we should give

some thought to the appropriate scaling of the nonboolean components, so that they neither

dominate the calculation nor are they irrelevant.

EXAMPLE 9.2 Suppose the only features of movies are the set of actors and the average

rating. Consider two movies with five actors each. Two of the actors are in both movies.

Also, one movie has an average rating of 3 and the other an average of 4. The vectors look

something like

0 1 1 0 1 1 0 1 3 α

1 1 0 1 0 1 1 0 4 α

However, there are in principle an infinite number of additional components, each with 0s

for both vectors, representing all the possible actors that neither movie has. Since cosine

Search WWH ::

Custom Search

Home