Machine Learning with MLlib - Learning Spark

Database Reference

In-Depth Information

to each other (because the same users interact with them) and which users are similar

to each other, and can make new recommendations.

While the MLlib API talks about “users” and “products,” you can also use collabora‐

tive filtering for other applications, such as recommending users to follow on a social

network, tags to add to an article, or songs to add to a radio station.

Alternating Least Squares

MLlib includes an implementation of Alternating Least Squares (ALS), a popular

algorithm for collaborative filtering that scales well on clusters. 6 It is located in the

mllib.recommendation.ALS class.

ALS works by determining a feature vector for each user and product, such that the

dot product of a user's vector and a product's is close to their score. It takes the fol‐

lowing parameters:

rank Size of feature vectors to use; larger ranks can lead to better models but are more

expensive to compute (default: 10 ).

iterations

Number of iterations to run (default: 10 ).

lambda

Regularization parameter (default: 0.01 ).

alpha

A constant used for computing confidence in implicit ALS (default: 1.0 ).

numUserBlocks , numProductBlocks

Number of blocks to divide user and product data in, to control parallelism; you

can pass -1 to let MLlib automatically determine this (the default behavior).

To use ALS, you need to give it an RDD of mllib.recommendation.Rating objects,

each of which contains a user ID, a product ID, and a rating (either an explicit rating

or implicit feedback; see upcoming discussion). One challenge with the implementa‐

tion is that each ID needs to be a 32-bit integer. If your IDs are strings or larger num‐

bers, it is recommended to just use the hash code of each ID in ALS; even if two users

or products map to the same ID, overall results can still be good. Alternatively, you

can broadcast() a table of product-ID-to-integer mappings to give them unique IDs.

6 Two research papers on ALS for web-scale data are Zhou et al.'s “Large-Scale Parallel Collaborative Filtering

for the Netflix Prize” and Hu et al.'s “Collaborative Filtering for Implicit Feedback Datasets,” both from 2008.

Search WWH ::

Custom Search

Home