Database Reference
In-Depth Information
import
org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
// A class to represent documents -- will be turned into a SchemaRDD
case
class
LabeledDocument
(
id
:
Long
,
text
:
String
,
label
:
Double
)
val
documents
=
// (load RDD of LabeledDocument)
val
sqlContext
=
new
SQLContext
(
sc
)
import
sqlContext._
// Configure an ML pipeline with three stages: tokenizer, tf, and lr; each stage
// outputs a column in a SchemaRDD and feeds it to the next stage's input column
val
tokenizer
=
new
Tokenizer
()
// Splits each email into words
.
setInputCol
(
"text"
)
.
setOutputCol
(
"words"
)
val
tf
=
new
HashingTF
()
// Maps email words to vectors of 10000 features
.
setNumFeatures
(
10000
)
.
setInputCol
(
tokenizer
.
getOutputCol
)
.
setOutputCol
(
"features"
)
val
lr
=
new
LogisticRegression
()
// Uses "features" as inputCol by default
val
pipeline
=
new
Pipeline
().
setStages
(
Array
(
tokenizer
,
tf
,
lr
))
// Fit the pipeline to the training documents
val
model
=
pipeline
.
fit
(
documents
)
// Alternatively, instead of fitting once with the parameters above, we can do a
// grid search over some parameters and pick the best model via cross-validation
val
paramMaps
=
new
ParamGridBuilder
()
.
addGrid
(
tf
.
numFeatures
,
Array
(
10000
,
20000
))
.
addGrid
(
lr
.
maxIter
,
Array
(
100
,
200
))
.
build
()
// Builds all combinations of parameters
val
eval
=
new
BinaryClassificationEvaluator
()
val
cv
=
new
CrossValidator
()
.
setEstimator
(
lr
)
.
setEstimatorParamMaps
(
paramMaps
)
.
setEvaluator
(
eval
)
val
bestModel
=
cv
.
fit
(
documents
)
The pipeline API is still experimental at the time of writing, but you can always find
the most recent documentation for it in the
MLlib documentation
.
Conclusion
This chapter has given an overview of Spark's machine learning library. As you can
see, the library ties directly to Spark's other APIs, letting you work on RDDs and get
back results you can use in other Spark functions. MLlib is one of the most actively
developed parts of Spark, so it is still evolving. We recommend checking the
official
documentation
for your Spark version to see the latest functions available in it.