Database Reference
In-Depth Information
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
val rawData = sc.textFile("/PATH/ml-100k/u.data")
val rawRatings = rawData.map(_.split("\t").take(3))
val ratings = rawRatings.map{ case Array(user, movie,
rating) => Rating(user.toInt, movie.toInt, rating.toDouble)
}
ratings.cache
val alsModel = ALS.train(ratings, 50, 10, 0.1)
model returned contains the factors in two RDDs of key-value pairs (called
user-
Features
and
productFeatures
) with the user or movie ID as the key and the
factor as the value. We will need to extract just the factors and transform each one of them
into an MLlib
Vector
to use as training input for our clustering model.
We will do this for both users and movies as follows:
import org.apache.spark.mllib.linalg.Vectors
val movieFactors = alsModel.productFeatures.map { case (id,
factor) => (id, Vectors.dense(factor)) }
val movieVectors = movieFactors.map(_._2)
val userFactors = alsModel.userFeatures.map { case (id,
factor) => (id, Vectors.dense(factor)) }
val userVectors = userFactors.map(_._2)
Normalization
Before we train our clustering model, it might be useful to look into the distribution of the
input data in the form of the factor vectors. This will tell us whether we need to normalize
the training data.
We will follow the same approach as we did in
Chapter 5
,
Building a Classification Model
with Spark
, using MLlib's summary statistics available in the distributed
RowMatrix
class:
import org.apache.spark.mllib.linalg.distributed.RowMatrix
val movieMatrix = new RowMatrix(movieVectors)
val movieMatrixSummary =
movieMatrix.computeColumnSummaryStatistics()