Semi-Supervised Support Vector Machines - Introduction to Semi-Supervised Learning

Geoscience Reference

In-Depth Information

Finally, we point out a computational difficulty of S3VMs. The S3VM objective func-

tion (6.16) is non-convex . A function g is convex, if for

∀ z 1 ,z 2 , ∀ 0 ≤ λ ≤ 1,

g(λz 1 + ( 1

− λ)z 2 ) ≤ λg(z 1 ) + ( 1

− λ)g(z 2 ).

(6.21)

For example, the SVM objective (6.12) is a convex function of the parameters w ,b . This can be

verified by the convexity of the hinge loss, the squared norm, and the fact that the sum of convex

functions is convex. Minimizing a convex function is relatively easy, as such a function has a well-

defined “bottom.” On the other hand, the hat loss function is non-convex, as demonstrated by z 1 =

−

0 . 5. With the sum of a large number of hat functions, the S3VM objective (6.16)

is non-convex with multiple local minima. A learning algorithm can get trapped in a sub-optimal

local minimum, and not find the global minimum solution. The research in S3VMs has focused on

how to efficiently find a near-optimum solution; some of this work is listed in the bibliographical

notes.

1 ,z 2 =

1, and λ

ENTROPY REGULARIZATION ∗

6.3

SVMs and S3VMs are non-probabilistic models. That is, they are not designed to compute the

label posterior probability p(y

x ) when making classification. In statistical machine learning, there

are many probabilistic models which compute p(y |

x ) from labeled training data for classification.

Interestingly, there is a direct analogue of S3VM for these probabilistic models too, known as entropy

regularization. To make our discussion concrete, we will first introduce a particular probabilistic

model: logistic regression, and then extend it to semi-supervised learning via entropy regularization.

Logistic regression models the posterior probability p(y |

x ) . Like SVMs, it uses a linear deci-

w x

sion function f( x ) =

+ b . Let the label y ∈{− 1 , 1 }

. Recall that if f( x ) 0, x is deep within

the positive side of the decision boundary, if f( x )

0, x is deep within the negative side, and

f( x ) = 0 means x is right on the decision boundary with maximum label uncertainty. Logistic

regression models the posterior probability by

p(y

x )

1 / ( 1

exp (

−

yf ( x ))) ,

(6.22)

which “squashes” f( x ) ∈ ( −∞ , ∞ ) down to p(y |

x ) ∈[ 0 , 1 ]

. The model parameters are w and b ,

i = 1 , the conditional log likelihood is defined

like in SVMs. Given a labeled training sample

{ ( x i ,y i ) }

log p(y i |

x i , w ,b).

(6.23)

i =

If we further introduce a Gaussian distribution as the prior on w :

∼ N

( 0 ,I/( 2 λ))

(6.24)

Introduction to Semi-Supervised Learning

Search WWH ::

Custom Search

Home