Building a Clustering Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

import org.apache.spark.mllib.recommendation.ALS

import org.apache.spark.mllib.recommendation.Rating

val rawData = sc.textFile("/PATH/ml-100k/u.data")

val rawRatings = rawData.map(_.split("\t").take(3))

val ratings = rawRatings.map{ case Array(user, movie,

rating) => Rating(user.toInt, movie.toInt, rating.toDouble)

}

ratings.cache

val alsModel = ALS.train(ratings, 50, 10, 0.1)

Recall from Chapter 4 , Building a Recommendation Engine with Spark , that the ALS

model returned contains the factors in two RDDs of key-value pairs (called user-

Features and productFeatures ) with the user or movie ID as the key and the

factor as the value. We will need to extract just the factors and transform each one of them

into an MLlib Vector to use as training input for our clustering model.

We will do this for both users and movies as follows:

import org.apache.spark.mllib.linalg.Vectors

val movieFactors = alsModel.productFeatures.map { case (id,

factor) => (id, Vectors.dense(factor)) }

val movieVectors = movieFactors.map(_._2)

val userFactors = alsModel.userFeatures.map { case (id,

factor) => (id, Vectors.dense(factor)) }

val userVectors = userFactors.map(_._2)

Normalization

Before we train our clustering model, it might be useful to look into the distribution of the

input data in the form of the factor vectors. This will tell us whether we need to normalize

the training data.

We will follow the same approach as we did in Chapter 5 , Building a Classification Model

with Spark , using MLlib's summary statistics available in the distributed RowMatrix

class:

import org.apache.spark.mllib.linalg.distributed.RowMatrix

val movieMatrix = new RowMatrix(movieVectors)

val movieMatrixSummary =

movieMatrix.computeColumnSummaryStatistics()

Search WWH ::

Custom Search

Home