Database Reference
In-Depth Information
Linear regression
Linear regression is one of the most common methods for regression, predicting the
output variable as a linear combination of the features. MLlib also supports
L
1
and
L
2
regularized regression, commonly known as
Lasso
and
ridge regression
.
The linear regression algorithms are available through the
mllib.regression.Line
arRegressionWithSGD
,
LassoWithSGD
, and
RidgeRegressionWithSGD
classes. These
follow a common naming pattern throughout MLlib, where problems involving mul‐
tiple algorithms have a “With” part in the class name to specify the algorithm used.
Here, SGD is Stochastic Gradient Descent.
These classes all have several parameters to tune the algorithm:
numIterations
Number of iterations to run (default:
100
).
stepSize
Step size for gradient descent (default:
1.0
).
intercept
Whether to add an intercept or bias feature to the data—that is, another feature
whose value is always 1 (default:
false
).
regParam
Regularization parameter for Lasso and ridge (default:
1.0
).
The way to call the algorithms differs slightly by language. In Java and Scala, you cre‐
ate a
LinearRegressionWithSGD
object, call setter methods on it to set the parame‐
ters, and then call
run()
to train a model. In Python, you instead use the class
method
LinearRegressionWithSGD.train()
, to which you pass key/value parame‐
ters. In both cases, you pass in an RDD of
LabeledPoint
s, as shown in Examples
11-10
through
11-12
.
Example 11-10. Linear regression in Python
from
pyspark.mllib.regression
import
LabeledPoint
from
pyspark.mllib.regression
import
LinearRegressionWithSGD
points
=
# (create RDD of LabeledPoint)
model
=
LinearRegressionWithSGD
.
train
(
points
,
iterations
=
200
,
intercept
=
True
)
print
"weights:
%s
, intercept:
%s
"
%
(
model
.
weights
,
model
.
intercept
)
Example 11-11. Linear regression in Scala
import
org.apache.spark.mllib.regression.LabeledPoint
import
org.apache.spark.mllib.regression.LinearRegressionWithSGD