Information Technology Reference
In-Depth Information
1) =
P
(
w
T
x
+
w
0
≥
P
−
1
=
P
(
Z
=1
,T
=
−
0
,T
=
−
1) =
q
(1
−
F
Y |−
1
(0))
,
(4.53)
1
,T
=1)=
P
(
w
T
x
+
w
0
≤
P
1
=
P
(
Z
=
−
0
,T
=1)=
pF
Y |
1
(0)
,
(4.54)
where
F
Y |t
(0) =
P
(
Y
T
=
t
) is the conditional distribution value at the
origin of the univariate r.v.
Y
=
w
T
x
+
w
0
.
A direct generalization of Theorem 4.2 can now be stated [212]:
≤
0
|
Theorem 4.3.
In the two-class multivariate problem, if the optimal set of
parameters given by
w
∗
=[
w
1
... w
d
w
0
]
T
of a separating hyperplane con-
stitute a critical point of the error entropy, then the error probabilities of each
class at
w
∗
are equal.
Proof.
We start by noticing that the multivariate classification problem can
be viewed has a univariate one using
u
=
w
T
x
, the projection of
x
onto
w
.
From an initial input (overall) distribution represented by a density
f
X
(
x
)=
qf
X|−
1
(
x
)+
pf
X|
1
(
x
) we get, on the projected space, the distribution of the
projected data given by
f
U
(
u
)=
qf
U|−
1
(
u
)+
pf
U|
1
(
u
). The parameter
w
0
then
works as a Stoller split: a data instance is classified as
ω
1
if
u
w
0
and as
ω
−
1
otherwise. From Theorem 4.1, one can assert that
qf
U|−
1
(
u
)=
pf
U|
1
(
u
)
at
w
∗
.
We rewrite the error probabilities of each class as
≥
P
−
1
=
q
(1
−
F
U|−
1
(
−
w
0
))
,
P
1
=
pF
U|
1
(
−
w
0
)
,
(4.55)
and compute
∂P
−
1
∂w
0
∂P
1
∂w
0
=
−
qf
U|−
1
(
−
w
0
)
,
=
pf
U|
1
(
−
w
0
)
.
(4.56)
From (4.2),
=ln
1
,
∂H
S
∂P
t
−
P
−
1
−
P
1
t
∈{−
1
,
1
}
,
P
t
the chain rule and the fact that
qf
U|−
1
=
pf
U|
1
at
w
∗
allows writting
∂H
S
∂w
0
(
w
∗
)=0
⇔
(4.57)
pf
U|
1
(
w
0
)
ln
1
ln
1
=0
−
P
−
1
−
P
1
−
P
−
1
−
P
1
⇔
−
⇔
(4.58)
P
−
1
P
1
f
U|
1
(
w
0
)=0
⇔
∨
P
−
1
=
P
1
.
(4.59)
Note that
f
U|
1
(
w
0
)=0iff the classes have distributions with disjoint supports
(they are separable). But in this case
P
−
1
=
P
1
=0. Thus, in both cases
P
−
1
=
P
1
is a necessary condition.