Databases Reference
In-Depth Information
Tags from Computer Games
An interesting direction for encouraging tagging is the “games” approach
pioneered by Luis von Ahn. He enabled two players to collaborate on the
tag for an image. In rounds, they would suggest a tag, and the tags would
be exchanged. If they agreed, then they “won,” and if not, they would
play another round with the same image, trying to agree simultaneously
on a tag. While an innovative direction to try, it is questionable whether
su cient public interest can be generated to produce enough free work to
satisfy the needs for tagged data.
there are enough tags that occasional erroneous ones will not bias the system
too much.
9.2.4
Representing Item Profiles
Our ultimate goal for content-based recommendation is to create both an item
profile consisting of feature-value pairs and a user profile summarizing the pref-
erences of the user, based of their row of the utility matrix. In Section 9.2.2
we suggested how an item profile could be constructed. We imagined a vector
of 0's and 1's, where a 1 represented the occurrence of a high-TF.IDF word
in the document. Since features for documents were all words, it was easy to
represent profiles this way.
We shall try to generalize this vector approach to all sorts of features. It is
easy to do so for features that are sets of discrete values. For example, if one
feature of movies is the set of stars, then imagine that there is a component
for each star, with 1 if the star is in the movie, and 0 if not. Likewise, we can
have a component for each possible director, and each possible genre. All these
features can be represented using only 0's and 1's.
There is another class of feature that is not readily represented by boolean
vectors: those features that are numerical. For instance, we might take the
average rating for movies to be a feature, 2 and this average is a real number.
It does not make sense to have one component for each of the possible average
ratings, and doing so would cause us to lose the structure implicit in numbers.
That is, two ratings that are close but not identical should be considered more
similar than widely differing ratings. Likewise, numerical features of products,
such as screen size or disk capacity for PC's, should be considered similar if
their values do not differ greatly.
Numerical features should be represented by single components of vectors
representing items. These components hold the exact value of that feature.
There is no harm if some components of the vectors are boolean and others are
2 The rating is not a very reliable feature, but it will serve as an example.
Search WWH ::




Custom Search