Machine Learning with MLlib - Learning Spark

Database Reference

In-Depth Information

val points : RDD [ LabeledPoint ] = // ...

val lr = new LinearRegressionWithSGD (). setNumIterations ( 200 ). setIntercept ( true )

val model = lr . run ( points )

println ( "weights: %s, intercept: %s" . format ( model . weights , model . intercept ))

Example 11-12. Linear regression in Java

import org.apache.spark.mllib.regression.LabeledPoint ;

import org.apache.spark.mllib.regression.LinearRegressionWithSGD ;

import org.apache.spark.mllib.regression.LinearRegressionModel ;

JavaRDD < LabeledPoint > points = // ...

LinearRegressionWithSGD lr =

new LinearRegressionWithSGD (). setNumIterations ( 200 ). setIntercept ( true );

LinearRegressionModel model = lr . run ( points . rdd ());

System . out . printf ( "weights: %s, intercept: %s\n" ,

model . weights (), model . intercept ());

Note that in Java, we need to convert our JavaRDD to the Scala RDD class by call‐

ing .rdd() on it. This is a common pattern throughout MLlib because the MLlib

methods are designed to be callable from both Java and Scala.

Once trained, the LinearRegressionModel returned in all languages includes a pre

dict() function that can be used to predict a value on a single vector. The RidgeRe

gressionWithSGD and LassoWithSGD classes behave similarly and return a similar

model class. Indeed, this pattern of an algorithm with parameters adjusted through

setters, which returns a Model object with a predict() method, is common in all of

MLlib.

Logistic regression

Logistic regression is a binary classification method that identifies a linear separating

plane between positive and negative examples. In MLlib, it takes LabeledPoint s with

label 0 or 1 and returns a LogisticRegressionModel that can predict new points.

The logistic regression algorithm has a very similar API to linear regression, covered

in the previous section. One difference is that there are two algorithms available for

solving it: SGD and LBFGS. 4 LBFGS is generally the best choice, but is not available

in some earlier versions of MLlib (before Spark 1.2). These algorithms are available in

the mllib.classification.LogisticRegressionWithLBFGS and WithSGD classes,

which have interfaces similar to LinearRegressionWithSGD . They take all the same

parameters as linear regression (see the previous section ).

4 LBFGS is an approximation to Newton's method that converges in fewer iterations than stochastic gradient

descent. It is described at http://en.wikipedia.org/wiki/Limited-memory_BFGS .

Search WWH ::

Custom Search

Home