Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

When regularization is absent or low, models can tend to over-fit. Without regularization,

most models will over-fit on a training dataset. This is a key reason behind the use of

cross-validation techniques for model fitting (which we will cover now).

Conversely, since applying regularization encourages simpler models, model performance

can suffer when regularization is high through under-fitting the data.

The forms of regularization available in MLlib are:

• SimpleUpdater : This equates to no regularization and is the default for logist-

ic regression

• SquaredL2Updater : This implements a regularizer based on the squared

L2-norm of the weight vector; this is the default for SVM models

• L1Updater : This applies a regularizer based on the L1-norm of the weight vec-

tor; this can lead to sparse solutions in the weight vector (as less important

weights are pulled towards zero)

Note

Regularization and its relation to optimization is a broad and heavily researched area.

Some more information is available from the following links:

• General regularization overview: http://en.wikipedia.org/wiki/Regulariza-

tion_(mathematics)

• L2 regularization: http://en.wikipedia.org/wiki/Tikhonov_regularization

• Over-fitting and under-fitting: http://en.wikipedia.org/wiki/Overfitting

• Detailed overview of over-fitting and L1 versus L2 regularization: ht-

Let's explore the impact of a range of regularization parameters using SquaredL2Up-

dater :

val regResults = Seq(0.001, 0.01, 0.1, 1.0, 10.0).map {

param =>

val model = trainWithParams(scaledDataCats, param,

numIterations, new SquaredL2Updater, 1.0)

createMetrics(s"$param L2 regularization parameter",

scaledDataCats, model)

}

Search WWH ::

Custom Search

Home