Geoscience Reference
In-Depth Information
64 CHAPTER 6. SEMI-SUPERVISED SUPPORTVECTORMACHINES
where
I
is the diagonal matrix of the appropriate dimension, then logistic regression training is to
maximize the posterior of the parameters:
l
max
w
,b
log
p(
w
,b
|{
(
x
i
,y
i
)
}
i
=
1
)
i
=
1
|
=
max
w
,b
log
p(
{
(
x
i
,y
i
)
}
w
,b)
+
log
p(
w
)
l
2
.
=
max
w
,b
log
(
1
/ (
1
+
exp
(
−
y
i
f(
x
i
))))
−
λ
w
(6.25)
i
=
1
The second line follows from Bayes rule, and ignoring the denominator that is constant with respect
to the parameters. This is equivalent to the following regularized risk minimization problem:
l
2
,
min
w
,b
log
(
1
+
exp
(
−
y
i
f(
x
i
)))
+
λ
w
(6.26)
i
=
1
with the so-called logistic loss
c(
x
,y,f(
x
))
=
log
(
1
+
exp
(
−
yf (
x
))) ,
(6.27)
2
. Figure 6.4(a) shows the logistic loss. Note its similarity to
and the usual regularizer
(f)
=
w
the hinge loss in Figure 6.3(a).
5
5
4
4
3
3
2
2
1
1
0
0
−5 −4 −3 −2 −1
0
1
2
3
4
5
−5 −4 −3 −2 −1
0
1
2
3
4
5
yf (
x
)
f (
x
)
(a) the logistic loss
(b) the entropy regularizer
Figure 6.4:
(a) The logistic loss
c(
x
,y,f(
x
))
yf (
x
)))
as a function of
yf (
x
)
. (b) The
entropy regularizer that encourages high-confidence classification on unlabeled data.
=
log
(
1
+
exp
(
−
Logistic regression does not use unlabeled data. We can include unlabeled data based on the
following intuition: if the two classes are well-separated, then the classification on any unlabeled
instance should be confident: it either clearly belongs to the positive class, or to the negative class.
Equivalently, the posterior probability
p(y
|
x
)
should be either close to 1, or close to 0. One way