Building a Classification Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

As mentioned earlier, the link function used in logistic regression is the logit link:

1 / (1 + exp(-w T x))

The related loss function for logistic regression is the logistic loss:

log(1 + exp(-yw T x))

Here, y is the actual target variable (either 1 for the positive class or -1 for the negative

class).

Linear support vector machines

SVM is a powerful and popular technique for regression and classification. Unlike logistic

regression, it is not a probabilistic model but predicts classes based on whether the model

evaluation is positive or negative.

The SVM link function is the identity link, so the predicted outcome is:

y = w T x

Hence, if the evaluation of w T x is greater than or equal to a threshold of 0, the SVM will

assign the data point to class 1; otherwise, the SVM will assign it to class 0 (this threshold

is a model parameter of SVM and can be adjusted).

The loss function for SVM is known as the hinge loss and is defined as:

max(0, 1 - yw T x)

SVM is a maximum margin classifier—it tries to find a weight vector such that the classes

are separated as much as possible. It has been shown to perform well on many classifica-

tion tasks, and the linear variant can scale to very large datasets.

Note

SVMs have a large amount of theory behind them, which is beyond the scope of this

topic, but you can visit http://en.wikipedia.org/wiki/Support_vector_machine and

http://www.support-vector-machines.org/ for more details.

Search WWH ::

Custom Search

Home