Improving recommendation accuracy based on item-specific tag preferences - Recommender Systems and the Social Web

Databases Reference

In-Depth Information

Constraint

Description

Min Users/Tag

( U/T )

Minimum number of users per tag.

Min Items/Tag

( I/T )

Minimum number of items per tag.

Min Tags/User

( T/U )

Minimum number of tags a user has specified.

Min Items/User ( I/U )

Minimum number of items rated by a user.

Min Tags/Item

( T/I )

Minimum number of tags applied to an item.

Min Users/Item ( U/I )

Minimum number of users which rated an item.

Table 4.4: Quality parameters.

4.5.1 Data sets, tag quality, and data preprocessing

We used two data sets in our evaluation. First, we evaluated our methods on the “MovieLens 10M

Ratings, 100k Tags” (ML) data set 5 , which was also used in the analysis by [Sen et al., 2009b]. The

data set consists of movie ratings on a 5-star scale with half-star increments. In addition, it contains

information about the tags that have been assigned by the users to the movies. A tag assignment is

a triple consisting of one user, one resource (movie) and one tag. No rating information for the tags

themselves is available in the original MovieLens database. To the best of our knowledge, the 10M

MovieLens data set is the only publicly available data set which contains both rating and tagging data.

It contains 10,000,054 ratings and 95,580 (unrated) tags applied to 10,681 movies by 71,567 users of the

online movie recommender service MovieLens.

Second, we used a new data set containing explicit tag preferences, which we collected in the user

study on the usage of tagging data for explanation purposes reported in Chapter 5. The data set contains

353 overall ratings for 100 movies provided by the 19 participants of the study. In addition to these overall

ratings, the study participants provided 5,295 explicit ratings for the tags attached to the movies. On

average, every user rated about 18 movies and each movie had 15 tags assigned.

Limited tag quality is one of the major issues when developing and evaluating approaches that operate

on the basis of user-contributed tags [Sen et al., 2007]. Therefore, different approaches to deal with the

problem of finding quality tags have been proposed in recent years, see, for example, [Gemmell et al.,

2009a], [Sen et al., 2007], or [Sen et al., 2009a].

Note that our approach of rating items by rating tags calls for a new quality requirement to tags:

tags must be appropriate for ratings. For example, there is no point in attaching a rating to a tag like

“bad movie” because the tag already represents a like/dislike statement. It would therefore not be clear

how to interpret a preference for such a tag. In our current work and evaluation, we did not take this

question into account yet, that is, we did not distinguish between tags that are appropriate for being

rated and those which are not. Still, we believe that this is one key question which was not considered

before and which should be taken into account in future approaches to extracting rating information for

tags automatically.

For the MovieLens (ML) data set, we applied and varied the constraints shown in Table 4.4 in order to

remove tags, users, or items for which not sucient data was available. This way, we varied the quality of

the existing tag information. We, for example, only considered movies, for which a minimum number of

tags is assigned (Min Tags/Item). This approach was also followed in previous work. In [Vig et al., 2009],

for example, the authors require that “a tag has been applied by at least 5 different users and to at least

2 different items”. Additionally, content analysis methods were applied to detect redundant tags such as

violent and violence , in order to replace them by one representative tag. Similar to their approach, we

further automatically pre-processed the data in three dimensions by removing stop-words from the tags,

by applying stemming [Porter, 1997] and by filtering out noisy tags, i.e., tags with a certain amount of

characters that are not letters, numbers or spaces, e.g., elements such as smileys.

We created three different versions from the tag-enhanced MovieLens data with different constraints

on data density, see Table 4.5 for an overview. Note that our quality and density requirements are

relatively weak when compared, for example, with the work of [Vig et al., 2009], who required that a

tag has been used by at least five users to be considered in the evaluation. As a result, the MAE values

we report are in general slightly higher than those reported in [Sen et al., 2009b], who also used similar

5 http://www.grouplens.org/node/73

Recommender Systems and the Social Web

Search WWH ::

Custom Search

Home