Information Technology Reference
In-Depth Information
7.5.1
Local Classification Models and Their Priors
Taking the generative point-of-view, it is assumed that a single classifier
k
ge-
nerates each of the classes with a fixed probability, independent of the input.
Thus, its model is, as already introduced in Sect. 4.2.2, given by
w
k
)=
w
y
kj
,
with
j
p
(
y
|
w
j
=1
.
(7.112)
D
Y
is the parameter vector of that classifier, with each of its elements
w
kj
modelling the generative probability for its associated class
j
. As a consequence,
its elements have to be non-negative and sum up to 1.
The conjugate prior
p
(
w
k
) on a classifier's parameters is given by the Dirichlet
distribution
w
k
∈
R
α
)=
C
(
α
)
j
w
α
j
−
1
kj
p
(
w
k
)=Dir(
w
k
|
,
(7.113)
D
Y
, that is equivalent for all classifiers, due to
the lack of better knowledge. Its normalising constant
C
(
α
)isgivenby
parametrised by the vector
α
∈
R
Γ(
α
)
j
Γ(
α
j
))
,
C
(
α
)=
(7.114)
where
α
denotes the sum of all elements of
α
,thatis
α
=
j
α
j
.
(7.115)
(
w
k
)=
α
/α
, and thus the elements of
α
allow us
to specify a prior bias towards one or the other class. Usually, nothing is known
about the class distribution for different areas of the input space, and so all
elements of
α
should be set to the same value.
In contrast to the relation of the different elements of
α
to each other, their
absolute magnitude specifies the strength of the prior, that is, how strongly the
prior affects the posterior in the light of further evidence. Intuitively speaking,
a change of 1 to an element of
α
represents one observation of the associated
class. Thus, to keep the prior non-informative it should be set to small positive
values, such as, for example,
α
=(10
−
2
,...,
10
−
2
)
T
.
Besides a different classifier model, no further modifications are required to
the Bayesian LCS model. Its hidden variables are now
U
=
E
Given this prior, we have
{
W
,
Z
,
V
,
β
}
,where
W
=
is the set of the classifier's parameters, whose distribution factorises
with respect to
k
,thatis
{
w
k
}
p
(
W
)=
k
p
(
w
k
)
.
(7.116)
Assuming knowledge of
X
and
M
, the joint distribution of data and hidden
variables is given by
p
(
Y
,
U
|
X
)=
p
(
Y
|
X
,
W
,
Z
)
p
(
W
)
p
(
Z
|
X
,
V
)
p
(
V
|
β
)
p
(
β
)
.
(7.117)
Search WWH ::
Custom Search