Database Reference
In-Depth Information
9.2.4
Representing Item Profiles
Our ultimate goal for content-based recommendation is to create both an item profile con-
sisting of feature-value pairs and a user profile summarizing the preferences of the user,
based of their row of the utility matrix. In Section 9.2.2 we suggested how an item pro-
file could be constructed. We imagined a vector of 0s and 1s, where a 1 represented the
occurrence of a high-TF.IDF word in the document. Since features for documents were all
words, it was easy to represent profiles this way.
We shall try to generalize this vector approach to all sorts of features. It is easy to do so
for features that are sets of discrete values. For example, if one feature of movies is the set
of actors, then imagine that there is a component for each actor, with 1 if the actor is in
the movie, and 0 if not. Likewise, we can have a component for each possible director, and
each possible genre. All these features can be represented using only 0s and 1s.
There is another class of features that is not readily represented by boolean vectors: those
features that are numerical. For instance, we might take the average rating for movies to
be a feature, 2 and this average is a real number. It does not make sense to have one com-
ponent for each of the possible average ratings, and doing so would cause us to lose the
structure implicit in numbers. That is, two ratings that are close but not identical should
be considered more similar than widely differing ratings. Likewise, numerical features of
products, such as screen size or disk capacity for PC's, should be considered similar if their
values do not differ greatly.
Numerical features should be represented by single components of vectors representing
items. These components hold the exact value of that feature. There is no harm if some
components of the vectors are boolean and others are real-valued or integer-valued. We can
still compute the cosine distance between vectors, although if we do so, we should give
some thought to the appropriate scaling of the nonboolean components, so that they neither
dominate the calculation nor are they irrelevant.
EXAMPLE 9.2 Suppose the only features of movies are the set of actors and the average
rating. Consider two movies with five actors each. Two of the actors are in both movies.
Also, one movie has an average rating of 3 and the other an average of 4. The vectors look
something like
0 1 1 0 1 1 0 1 3 α
1 1 0 1 0 1 1 0 4 α
However, there are in principle an infinite number of additional components, each with 0s
for both vectors, representing all the possible actors that neither movie has. Since cosine
Search WWH ::




Custom Search