Database Reference
In-Depth Information
val
points
:
RDD
[
LabeledPoint
]
=
// ...
val
lr
=
new
LinearRegressionWithSGD
().
setNumIterations
(
200
).
setIntercept
(
true
)
val
model
=
lr
.
run
(
points
)
println
(
"weights: %s, intercept: %s"
.
format
(
model
.
weights
,
model
.
intercept
))
Example 11-12. Linear regression in Java
import
org.apache.spark.mllib.regression.LabeledPoint
;
import
org.apache.spark.mllib.regression.LinearRegressionWithSGD
;
import
org.apache.spark.mllib.regression.LinearRegressionModel
;
JavaRDD
<
LabeledPoint
>
points
=
// ...
LinearRegressionWithSGD
lr
=
new
LinearRegressionWithSGD
().
setNumIterations
(
200
).
setIntercept
(
true
);
LinearRegressionModel
model
=
lr
.
run
(
points
.
rdd
());
System
.
out
.
printf
(
"weights: %s, intercept: %s\n"
,
model
.
weights
(),
model
.
intercept
());
Note that in Java, we need to convert our JavaRDD to the Scala
RDD
class by call‐
ing
.rdd()
on it. This is a common pattern throughout MLlib because the MLlib
methods are designed to be callable from both Java and Scala.
Once trained, the
LinearRegressionModel
returned in all languages includes a
pre
dict()
function that can be used to predict a value on a single vector. The
RidgeRe
gressionWithSGD
and
LassoWithSGD
classes behave similarly and return a similar
model class. Indeed, this pattern of an algorithm with parameters adjusted through
setters, which returns a
Model
object with a
predict()
method, is common in all of
MLlib.
Logistic regression
Logistic regression is a binary classification method that identifies a linear separating
plane between positive and negative examples. In MLlib, it takes
LabeledPoint
s with
label 0 or 1 and returns a
LogisticRegressionModel
that can
predict
new points.
The logistic regression algorithm has a very similar API to linear regression, covered
in the previous section. One difference is that there are two algorithms available for
solving it: SGD and LBFGS.
4
LBFGS is generally the best choice, but is not available
in some earlier versions of MLlib (before Spark 1.2). These algorithms are available in
the
mllib.classification.LogisticRegressionWithLBFGS
and
WithSGD
classes,
which have interfaces similar to
LinearRegressionWithSGD
. They take all the same
parameters as linear regression (see the
previous section
).
4
LBFGS is an approximation to Newton's method that converges in fewer iterations than stochastic gradient
descent. It is described at
http://en.wikipedia.org/wiki/Limited-memory_BFGS
.