Building a Regression Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Least squares regression

You might recall from Chapter 5 , Building a Classification Model with Spark , that there are

a variety of loss functions that can be applied to generalized linear models. The loss func-

tion used for least squares is the squared loss, which is defined as follows:

½ (w T x - y) 2

Here, as for the classification setting, y is the target variable (this time, real valued), w is

the weight vector, and x is the feature vector.

The related link function is the identity link, and the decision function is also the identity

function, as generally, no thresholding is applied in regression. So, the model's prediction is

simply y = w T x .

The standard least squares regression in MLlib does not use regularization. Looking at the

squared loss function, we can see that the loss applied to incorrectly predicted points will

be magnified since the loss is squared. This means that least squares regression is suscept-

ible to outliers in the dataset and also to over-fitting. Generally, as for classification, we

should apply some level of regularization in practice.

Linear regression with L2 regularization is commonly referred to as ridge regression, while

applying L1 regularization is called the lasso .

Tip

See the section on linear least squares in the Spark MLlib documentation at ht-

ridge-regression for further information.

Search WWH ::

Custom Search

Home