Building a Recommendation Engine with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Computing similarity with item-factor vectors

The benefit of factorization models is the relative ease of computing recommendations

once the model is created. However, for very large user and itemsets, this can become a

challenge as it requires storage and computation across potentially many millions of user-

and item-factor vectors. Another advantage, as mentioned earlier, is that they tend to offer

very good performance.

Note

Projects such as Oryx ( https://github.com/OryxProject/oryx ) and Prediction.io ( ht-

tps://github.com/PredictionIO/PredictionIO ) focus on model serving for large-scale mod-

els, including recommenders based on matrix factorization.

On the down side, factorization models are relatively more complex to understand and in-

terpret compared to nearest-neighbor models and are often more computationally intens-

ive during the model's training phase.

Implicit matrix factorization

So far, we have dealt with explicit preferences such as ratings. However, much of the pref-

erence data that we might be able to collect is implicit feedback, where the preferences

between a user and item are not given to us, but are, instead, implied from the interactions

they might have with an item. Examples include binary data (such as whether a user

viewed a movie, whether they purchased a product, and so on) as well as count data (such

as the number of times a user watched a movie).

There are many different approaches to deal with implicit data. MLlib implements a par-

ticular approach that treats the input rating matrix as two matrices: a binary preference

matrix, P , and a matrix of confidence weights, C .

For example, let's assume that the user-movie ratings we saw previously were, in fact, the

number of times each user had viewed that movie. The two matrices would look

something like ones shown in the following screenshot. Here, the matrix P informs us that

a movie was viewed by a user, and the matrix C represents the confidence weighting, in

the form of the view counts—generally, the more a user has watched a movie, the higher

the confidence that they actually like it.

Search WWH ::

Custom Search

Home