Database Reference
In-Depth Information
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
// A class to represent documents -- will be turned into a SchemaRDD
case class LabeledDocument ( id : Long , text : String , label : Double )
val documents = // (load RDD of LabeledDocument)
val sqlContext = new SQLContext ( sc )
import sqlContext._
// Configure an ML pipeline with three stages: tokenizer, tf, and lr; each stage
// outputs a column in a SchemaRDD and feeds it to the next stage's input column
val tokenizer = new Tokenizer () // Splits each email into words
. setInputCol ( "text" )
. setOutputCol ( "words" )
val tf = new HashingTF () // Maps email words to vectors of 10000 features
. setNumFeatures ( 10000 )
. setInputCol ( tokenizer . getOutputCol )
. setOutputCol ( "features" )
val lr = new LogisticRegression () // Uses "features" as inputCol by default
val pipeline = new Pipeline (). setStages ( Array ( tokenizer , tf , lr ))
// Fit the pipeline to the training documents
val model = pipeline . fit ( documents )
// Alternatively, instead of fitting once with the parameters above, we can do a
// grid search over some parameters and pick the best model via cross-validation
val paramMaps = new ParamGridBuilder ()
. addGrid ( tf . numFeatures , Array ( 10000 , 20000 ))
. addGrid ( lr . maxIter , Array ( 100 , 200 ))
. build () // Builds all combinations of parameters
val eval = new BinaryClassificationEvaluator ()
val cv = new CrossValidator ()
. setEstimator ( lr )
. setEstimatorParamMaps ( paramMaps )
. setEvaluator ( eval )
val bestModel = cv . fit ( documents )
The pipeline API is still experimental at the time of writing, but you can always find
the most recent documentation for it in the MLlib documentation .
Conclusion
This chapter has given an overview of Spark's machine learning library. As you can
see, the library ties directly to Spark's other APIs, letting you work on RDDs and get
back results you can use in other Spark functions. MLlib is one of the most actively
developed parts of Spark, so it is still evolving. We recommend checking the official
documentation for your Spark version to see the latest functions available in it.
Search WWH ::




Custom Search