Database Reference
In-Depth Information
Training a clustering model on the MovieLens
dataset
We will train a model for both the movie and user factors that we generated by running our
recommendation model. We need to pass in the number of clusters K and the maximum
number of iterations for the algorithm to run. Model training might run for less than the
maximum number of iterations if the change in the objective function from one iteration to
the next is less than the tolerance level (the default for this tolerance is 0.0001).
MLlib's K-means provides random and K-means || initialization, with the default being K-
means ||. As both of these initialization methods are based on random selection to some ex-
tent, each model training run will return a different result.
K-means does not generally converge to a global optimum model, so performing multiple
training runs and selecting the most optimal model from these runs is a common practice.
MLlib's training methods expose an option to complete multiple model training runs. The
best training run, as measured by the evaluation of the loss function, is selected as the final
model.
We will first set up the required imports, as well as model parameters: K, maximum itera-
tions, and number of runs:
import org.apache.spark.mllib.clustering.KMeans
val numClusters = 5
val numIterations = 10
val numRuns = 3
We will then run K-means on the movie factor vectors:
val movieClusterModel = KMeans.train(movieVectors,
numClusters, numIterations, numRuns)
Once the model has completed training, we should see output that looks something like
this:
...
14/09/02 21:53:58 INFO SparkContext: Job finished:
collectAsMap at KMeans.scala:193, took 0.02043 s
14/09/02 21:53:58 INFO KMeans: Iterations took 0.331 seconds.
Search WWH ::




Custom Search