Machine Learning with MLlib - Learning Spark

Database Reference

In-Depth Information

import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator

// A class to represent documents -- will be turned into a SchemaRDD

case class LabeledDocument ( id : Long , text : String , label : Double )

val documents = // (load RDD of LabeledDocument)

val sqlContext = new SQLContext ( sc )

import sqlContext._

// Configure an ML pipeline with three stages: tokenizer, tf, and lr; each stage

// outputs a column in a SchemaRDD and feeds it to the next stage's input column

val tokenizer = new Tokenizer () // Splits each email into words

. setInputCol ( "text" )

. setOutputCol ( "words" )

val tf = new HashingTF () // Maps email words to vectors of 10000 features

. setNumFeatures ( 10000 )

. setInputCol ( tokenizer . getOutputCol )

. setOutputCol ( "features" )

val lr = new LogisticRegression () // Uses "features" as inputCol by default

val pipeline = new Pipeline (). setStages ( Array ( tokenizer , tf , lr ))

// Fit the pipeline to the training documents

val model = pipeline . fit ( documents )

// Alternatively, instead of fitting once with the parameters above, we can do a

// grid search over some parameters and pick the best model via cross-validation

val paramMaps = new ParamGridBuilder ()

. addGrid ( tf . numFeatures , Array ( 10000 , 20000 ))

. addGrid ( lr . maxIter , Array ( 100 , 200 ))

. build () // Builds all combinations of parameters

val eval = new BinaryClassificationEvaluator ()

val cv = new CrossValidator ()

. setEstimator ( lr )

. setEstimatorParamMaps ( paramMaps )

. setEvaluator ( eval )

val bestModel = cv . fit ( documents )

The pipeline API is still experimental at the time of writing, but you can always find

the most recent documentation for it in the MLlib documentation .

Conclusion

This chapter has given an overview of Spark's machine learning library. As you can

see, the library ties directly to Spark's other APIs, letting you work on RDDs and get

back results you can use in other Spark functions. MLlib is one of the most actively

developed parts of Spark, so it is still evolving. We recommend checking the official

documentation for your Spark version to see the latest functions available in it.