Database Reference
In-Depth Information
Least squares regression
You might recall from Chapter 5 , Building a Classification Model with Spark , that there are
a variety of loss functions that can be applied to generalized linear models. The loss func-
tion used for least squares is the squared loss, which is defined as follows:
½ (w T x - y) 2
Here, as for the classification setting, y is the target variable (this time, real valued), w is
the weight vector, and x is the feature vector.
The related link function is the identity link, and the decision function is also the identity
function, as generally, no thresholding is applied in regression. So, the model's prediction is
simply y = w T x .
The standard least squares regression in MLlib does not use regularization. Looking at the
squared loss function, we can see that the loss applied to incorrectly predicted points will
be magnified since the loss is squared. This means that least squares regression is suscept-
ible to outliers in the dataset and also to over-fitting. Generally, as for classification, we
should apply some level of regularization in practice.
Linear regression with L2 regularization is commonly referred to as ridge regression, while
applying L1 regularization is called the lasso .
Tip
See the section on linear least squares in the Spark MLlib documentation at ht-
tp://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-
ridge-regression for further information.
Search WWH ::




Custom Search