Decision Trees and Recommender Systems - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

16.3.7

Clustering the Items

Searching the space of all possible pairs is reasonable only for a small set

of items. We can try to reduce running time by various means. First, if the

newcomer user v has already commented on some item pairs, then these

pairs should be skipped during the search. Moreover, we can skip any item

that has not been rated by a suciently large number of users in the current

set of similar users. Their exclusion from the process can speed execution

time. Moreover, the searching can be easily parallelized because each pair

can be analyzed independently. Still, since typical RSs include thousands

of rated items, it is completely impractical or very expensive to go over all

remaining pairs.

A different problem arises due to the sparse rating data problem. In

many datasets only a few pairs can provide a suciently large set of similar

users to each possible comparison outcome. In the remaining pairs, the

empty bucket (i.e. C =

) will populate most of the users in N ( v ).

One way to resolve these drawbacks is to cluster the items. The goal

of clustering is to group items so that intra-cluster similarities of the items

are maximized and inter-cluster similarities are minimized. We perform the

clustering in the latent factor space. Thus, the similarity between two items

can be calculated as the cosine similarity between vectors of the two items,

q i and q j . By clustering the items, we define an abstract item that has

general properties similar to a set of actual items.

Instead of searching for the best pair of items, we should search now

for the best pair of clusters. We can still use Equation (16.9) for this

purpose. However, we need to handle the individual rating of the users

differently. Because the same user can rate multiple items in the same

cluster, her corresponding pairwise score as obtained from Equation (16.8)

is not unequivocal. There are many ways to aggregate all cluster-wise

ratings into a single score. Here, we implement a simple approach. We first

determine for each cluster l its centroid vector in the factorized space q l ,

then the aggregated rating of user u to cluster l is defined as r ul ≡

∅

q i T

p u .

Finally, the cluster pairwise score can be determined using Equation (16.9).

After finding the pair of clusters, we need to transform the selected pairs

into a simple visual question that the user can easily answer. Because the

user has no notion of the item clusters, we propose to represent the pairwise

comparison of the two clusters ( s, t ), by the posters of two popular items

that most differentiate between these two clusters. For this purpose, we first

sort the items in each cluster by their popularity (number of times it was

rated). Popular items are preferred to ensure that the user recognizes the

·

Data Mining with Decision Trees: Theory and Applications

Search WWH ::

Custom Search

Home