Database Reference
In-Depth Information
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
val rawData = sc.textFile("/PATH/ml-100k/u.data")
val rawRatings = rawData.map(_.split("\t").take(3))
val ratings = rawRatings.map{ case Array(user, movie,
rating) => Rating(user.toInt, movie.toInt, rating.toDouble)
}
ratings.cache
val alsModel = ALS.train(ratings, 50, 10, 0.1)
Recall from Chapter 4 , Building a Recommendation Engine with Spark , that the ALS
model returned contains the factors in two RDDs of key-value pairs (called user-
Features and productFeatures ) with the user or movie ID as the key and the
factor as the value. We will need to extract just the factors and transform each one of them
into an MLlib Vector to use as training input for our clustering model.
We will do this for both users and movies as follows:
import org.apache.spark.mllib.linalg.Vectors
val movieFactors = alsModel.productFeatures.map { case (id,
factor) => (id, Vectors.dense(factor)) }
val movieVectors = movieFactors.map(_._2)
val userFactors = alsModel.userFeatures.map { case (id,
factor) => (id, Vectors.dense(factor)) }
val userVectors = userFactors.map(_._2)
Normalization
Before we train our clustering model, it might be useful to look into the distribution of the
input data in the form of the factor vectors. This will tell us whether we need to normalize
the training data.
We will follow the same approach as we did in Chapter 5 , Building a Classification Model
with Spark , using MLlib's summary statistics available in the distributed RowMatrix
class:
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val movieMatrix = new RowMatrix(movieVectors)
val movieMatrixSummary =
movieMatrix.computeColumnSummaryStatistics()
Search WWH ::




Custom Search