Geoscience Reference
In-Depth Information
Finally, we point out a computational difficulty of S3VMs. The S3VM objective func-
tion (6.16) is non-convex . A function g is convex, if for
z 1 ,z 2 , 0 λ 1,
g(λz 1 + ( 1
λ)z 2 ) λg(z 1 ) + ( 1
λ)g(z 2 ).
(6.21)
For example, the SVM objective (6.12) is a convex function of the parameters w ,b . This can be
verified by the convexity of the hinge loss, the squared norm, and the fact that the sum of convex
functions is convex. Minimizing a convex function is relatively easy, as such a function has a well-
defined “bottom.” On the other hand, the hat loss function is non-convex, as demonstrated by z 1 =
0 . 5. With the sum of a large number of hat functions, the S3VM objective (6.16)
is non-convex with multiple local minima. A learning algorithm can get trapped in a sub-optimal
local minimum, and not find the global minimum solution. The research in S3VMs has focused on
how to efficiently find a near-optimum solution; some of this work is listed in the bibliographical
notes.
1 ,z 2 =
1, and λ
=
ENTROPY REGULARIZATION
6.3
SVMs and S3VMs are non-probabilistic models. That is, they are not designed to compute the
label posterior probability p(y
x ) when making classification. In statistical machine learning, there
are many probabilistic models which compute p(y |
|
x ) from labeled training data for classification.
Interestingly, there is a direct analogue of S3VM for these probabilistic models too, known as entropy
regularization. To make our discussion concrete, we will first introduce a particular probabilistic
model: logistic regression, and then extend it to semi-supervised learning via entropy regularization.
Logistic regression models the posterior probability p(y |
x ) . Like SVMs, it uses a linear deci-
w x
sion function f( x ) =
+ b . Let the label y ∈{− 1 , 1 }
. Recall that if f( x ) 0, x is deep within
the positive side of the decision boundary, if f( x )
0, x is deep within the negative side, and
f( x ) = 0 means x is right on the decision boundary with maximum label uncertainty. Logistic
regression models the posterior probability by
p(y
|
x )
=
1 / ( 1
+
exp (
yf ( x ))) ,
(6.22)
which “squashes” f( x ) ( −∞ , ) down to p(y |
x ) ∈[ 0 , 1 ]
. The model parameters are w and b ,
i = 1 , the conditional log likelihood is defined
like in SVMs. Given a labeled training sample
{ ( x i ,y i ) }
as
l
log p(y i |
x i , w ,b).
(6.23)
i =
1
If we further introduce a Gaussian distribution as the prior on w :
w
N
( 0 ,I/( 2 λ))
(6.24)
Search WWH ::




Custom Search