Building a Recommendation Engine with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Training a model on the MovieLens 100k dataset

We're now ready to train our model! The other inputs required for our model are as follows:

• rank : This refers to the number of factors in our ALS model, that is, the number

of hidden features in our low-rank approximation matrices. Generally, the greater

the number of factors, the better, but this has a direct impact on memory usage,

both for computation and to store models for serving, particularly for large number

of users or items. Hence, this is often a trade-off in real-world use cases. A rank in

the range of 10 to 200 is usually reasonable.

• iterations : This refers to the number of iterations to run. While each iteration

in ALS is guaranteed to decrease the reconstruction error of the ratings matrix,

ALS models will converge to a reasonably good solution after relatively few itera-

tions. So, we don't need to run for too many iterations in most cases (around 10 is

often a good default).

• lambda : This parameter controls the regularization of our model. Thus, lambda

controls over fitting. The higher the value of lambda , the more is the regulariza-

tion applied. What constitutes a sensible value is very dependent on the size,

nature, and sparsity of the underlying data, and as with almost all machine learning

models, the regularization parameter is something that should be tuned using out-

of-sample test data and cross-validation approaches.

We'll use rank of 50 , 10 iterations, and a lambda parameter of 0.01 to illustrate how to

train our model:

val model = ALS.train(ratings, 50, 10, 0.01)

This returns a MatrixFactorizationModel object, which contains the user and item

factors in the form of an RDD of (id, factor) pairs. These are called user-

Features and productFeatures , respectively. For example:

model.userFeatures

You will see the output as:

res14: org.apache.spark.rdd.RDD[(Int, Array[Double])] =

FlatMappedRDD[659] at flatMap at ALS.scala:231

We can see that the factors are in the form of an Array[Double] .

Search WWH ::

Custom Search

Home