Database Reference
In-Depth Information
Linear regression
Linear regression is one of the most common methods for regression, predicting the
output variable as a linear combination of the features. MLlib also supports L 1 and L 2
regularized regression, commonly known as Lasso and ridge regression .
The linear regression algorithms are available through the mllib.regression.Line
arRegressionWithSGD , LassoWithSGD , and RidgeRegressionWithSGD classes. These
follow a common naming pattern throughout MLlib, where problems involving mul‐
tiple algorithms have a “With” part in the class name to specify the algorithm used.
Here, SGD is Stochastic Gradient Descent.
These classes all have several parameters to tune the algorithm:
numIterations
Number of iterations to run (default: 100 ).
stepSize
Step size for gradient descent (default: 1.0 ).
intercept
Whether to add an intercept or bias feature to the data—that is, another feature
whose value is always 1 (default: false ).
regParam
Regularization parameter for Lasso and ridge (default: 1.0 ).
The way to call the algorithms differs slightly by language. In Java and Scala, you cre‐
ate a LinearRegressionWithSGD object, call setter methods on it to set the parame‐
ters, and then call run() to train a model. In Python, you instead use the class
method LinearRegressionWithSGD.train() , to which you pass key/value parame‐
ters. In both cases, you pass in an RDD of LabeledPoint s, as shown in Examples
11-10 through 11-12 .
Example 11-10. Linear regression in Python
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.regression import LinearRegressionWithSGD
points = # (create RDD of LabeledPoint)
model = LinearRegressionWithSGD . train ( points , iterations = 200 , intercept = True )
print "weights: %s , intercept: %s " % ( model . weights , model . intercept )
Example 11-11. Linear regression in Scala
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.regression.LinearRegressionWithSGD
Search WWH ::




Custom Search