Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Training a classification model on the Kaggle/

StumbleUpon evergreen classification dataset

We can now apply the models from MLlib to our input data. First, we need to import the

required classes and set up some minimal input parameters for each model. For logistic re-

gression and SVM, this is the number of iterations, while for the decision tree model, it is

the maximum tree depth:

import

org.apache.spark.mllib.classification.LogisticRegressionWithSGD

import org.apache.spark.mllib.classification.SVMWithSGD

import org.apache.spark.mllib.classification.NaiveBayes

import org.apache.spark.mllib.tree.DecisionTree

import org.apache.spark.mllib.tree.configuration.Algo

import org.apache.spark.mllib.tree.impurity.Entropy

val numIterations = 10

val maxTreeDepth = 5

Now, train each model in turn. First, we will train logistic regression:

val lrModel = LogisticRegressionWithSGD.train(data,

numIterations)

...

14/12/06 13:41:47 INFO DAGScheduler: Job 81 finished: reduce

at RDDFunctions.scala:112, took 0.011968 s

14/12/06 13:41:47 INFO GradientDescent:

GradientDescent.runMiniBatchSGD finished. Last 10 stochastic

losses 0.6931471805599474, 1196521.395699124, Infinity,

1861127.002201189, Infinity, 2639638.049627607, Infinity,

Infinity, Infinity, Infinity

lrModel:

org.apache.spark.mllib.classification.LogisticRegressionModel

= (weights=[-0.11372778986947886,-0.511619752777837,

...

Next up, we will train an SVM model:

Search WWH ::

Custom Search

Home