Database Reference
In-Depth Information
16.3.7
Clustering the Items
Searching the space of all possible pairs is reasonable only for a small set
of items. We can try to reduce running time by various means. First, if the
newcomer user v has already commented on some item pairs, then these
pairs should be skipped during the search. Moreover, we can skip any item
that has not been rated by a suciently large number of users in the current
set of similar users. Their exclusion from the process can speed execution
time. Moreover, the searching can be easily parallelized because each pair
can be analyzed independently. Still, since typical RSs include thousands
of rated items, it is completely impractical or very expensive to go over all
remaining pairs.
A different problem arises due to the sparse rating data problem. In
many datasets only a few pairs can provide a suciently large set of similar
users to each possible comparison outcome. In the remaining pairs, the
empty bucket (i.e. C =
) will populate most of the users in N ( v ).
One way to resolve these drawbacks is to cluster the items. The goal
of clustering is to group items so that intra-cluster similarities of the items
are maximized and inter-cluster similarities are minimized. We perform the
clustering in the latent factor space. Thus, the similarity between two items
can be calculated as the cosine similarity between vectors of the two items,
q i and q j . By clustering the items, we define an abstract item that has
general properties similar to a set of actual items.
Instead of searching for the best pair of items, we should search now
for the best pair of clusters. We can still use Equation (16.9) for this
purpose. However, we need to handle the individual rating of the users
differently. Because the same user can rate multiple items in the same
cluster, her corresponding pairwise score as obtained from Equation (16.8)
is not unequivocal. There are many ways to aggregate all cluster-wise
ratings into a single score. Here, we implement a simple approach. We first
determine for each cluster l its centroid vector in the factorized space q l ,
then the aggregated rating of user u to cluster l is defined as r ul
q i T
p u .
Finally, the cluster pairwise score can be determined using Equation (16.9).
After finding the pair of clusters, we need to transform the selected pairs
into a simple visual question that the user can easily answer. Because the
user has no notion of the item clusters, we propose to represent the pairwise
comparison of the two clusters ( s, t ), by the posters of two popular items
that most differentiate between these two clusters. For this purpose, we first
sort the items in each cluster by their popularity (number of times it was
rated). Popular items are preferred to ensure that the user recognizes the
·
Search WWH ::




Custom Search