Database Reference
In-Depth Information
to each other (because the same users interact with them) and which users are similar
to each other, and can make new recommendations.
While the MLlib API talks about “users” and “products,” you can also use collabora‐
tive filtering for other applications, such as recommending users to follow on a social
network, tags to add to an article, or songs to add to a radio station.
Alternating Least Squares
MLlib includes an implementation of Alternating Least Squares (ALS), a popular
algorithm for collaborative filtering that scales well on clusters. 6 It is located in the
mllib.recommendation.ALS class.
ALS works by determining a feature vector for each user and product, such that the
dot product of a user's vector and a product's is close to their score. It takes the fol‐
lowing parameters:
rank Size of feature vectors to use; larger ranks can lead to better models but are more
expensive to compute (default: 10 ).
iterations
Number of iterations to run (default: 10 ).
lambda
Regularization parameter (default: 0.01 ).
alpha
A constant used for computing confidence in implicit ALS (default: 1.0 ).
numUserBlocks , numProductBlocks
Number of blocks to divide user and product data in, to control parallelism; you
can pass -1 to let MLlib automatically determine this (the default behavior).
To use ALS, you need to give it an RDD of mllib.recommendation.Rating objects,
each of which contains a user ID, a product ID, and a rating (either an explicit rating
or implicit feedback; see upcoming discussion). One challenge with the implementa‐
tion is that each ID needs to be a 32-bit integer. If your IDs are strings or larger num‐
bers, it is recommended to just use the hash code of each ID in ALS; even if two users
or products map to the same ID, overall results can still be good. Alternatively, you
can broadcast() a table of product-ID-to-integer mappings to give them unique IDs.
6 Two research papers on ALS for web-scale data are Zhou et al.'s “Large-Scale Parallel Collaborative Filtering
for the Netflix Prize” and Hu et al.'s “Collaborative Filtering for Implicit Feedback Datasets,” both from 2008.
 
Search WWH ::




Custom Search