Database Reference
In-Depth Information
ALS returns a MatrixFactorizationModel representing its results, which can be used
to predict() ratings for an RDD of (userID, productID) pairs. 7 Alternatively, you
can use model.recommendProducts(userId, numProducts) to find the top numProd
ucts() products recommended for a given user. Note that unlike other models in
MLlib, the MatrixFactorizationModel is large, holding one vector for each user and
product . This means that it cannot be saved to disk and then loaded back in another
run of your program. Instead, you can save the RDDs of feature vectors produced in
it, model.userFeatures and model.productFeatures , to a distributed filesystem.
Finally, there are two variants of ALS: for explicit ratings (the default) and for
implicit ratings (which you enable by calling ALS.trainImplicit() instead of
ALS.train() ). With explicit ratings, each user's rating for a product needs to be a
score (e.g., 1 to 5 stars), and the predicted ratings will be scores. With implicit feed‐
back, each rating represents a confidence that users will interact with a given item
(e.g., the rating might go up the more times a user visits a web page), and the predic‐
ted items will be confidence values. Further details about ALS with implicit ratings
are described in Hu et al., “Collaborative Filtering for Implicit Feedback Datasets,”
ICDM 2008.
Dimensionality Reduction
Principal component analysis
Given a dataset of points in a high-dimensional space, we are often interested in
reducing the dimensionality of the points so that they can be analyzed with simpler
tools. For example, we might want to plot the points in two dimensions, or just
reduce the number of features to train models more effectively.
The main technique for dimensionality reduction used by the machine learning com‐
munity is principal component analysis (PCA) . In this technique, the mapping to the
lower-dimensional space is done such that the variance of the data in the low-
dimensional representation is maximized, thus ignoring noninformative dimensions.
To compute the mapping, the normalized correlation matrix of the data is construc‐
ted and the singular vectors and values of this matrix are used. The singular vectors
that correspond to the largest singular values are used to reconstruct a large fraction
of the variance of the original data.
PCA is currently available only in Java and Scala (as of MLlib 1.2). To invoke it, you
must first represent your matrix using the mllib.linalg.distributed.RowMatrix
7 In Java, start with a JavaRDD of Tuple2<Integer, Integer> and call .rdd() on it.
 
Search WWH ::




Custom Search