Database Reference
In-Depth Information
Training a classification model on the Kaggle/
StumbleUpon evergreen classification dataset
We can now apply the models from MLlib to our input data. First, we need to import the
required classes and set up some minimal input parameters for each model. For logistic re-
gression and SVM, this is the number of iterations, while for the decision tree model, it is
the maximum tree depth:
import
org.apache.spark.mllib.classification.LogisticRegressionWithSGD
import org.apache.spark.mllib.classification.SVMWithSGD
import org.apache.spark.mllib.classification.NaiveBayes
import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.configuration.Algo
import org.apache.spark.mllib.tree.impurity.Entropy
val numIterations = 10
val maxTreeDepth = 5
Now, train each model in turn. First, we will train logistic regression:
val lrModel = LogisticRegressionWithSGD.train(data,
numIterations)
...
14/12/06 13:41:47 INFO DAGScheduler: Job 81 finished: reduce
at RDDFunctions.scala:112, took 0.011968 s
14/12/06 13:41:47 INFO GradientDescent:
GradientDescent.runMiniBatchSGD finished. Last 10 stochastic
losses 0.6931471805599474, 1196521.395699124, Infinity,
1861127.002201189, Infinity, 2639638.049627607, Infinity,
Infinity, Infinity, Infinity
lrModel:
org.apache.spark.mllib.classification.LogisticRegressionModel
= (weights=[-0.11372778986947886,-0.511619752777837,
...
Next up, we will train an SVM model:
Search WWH ::




Custom Search