Information Technology Reference
In-Depth Information
7.5.1
Local Classification Models and Their Priors
Taking the generative point-of-view, it is assumed that a single classifier k ge-
nerates each of the classes with a fixed probability, independent of the input.
Thus, its model is, as already introduced in Sect. 4.2.2, given by
w k )= w y kj ,
with
j
p ( y
|
w j =1 .
(7.112)
D Y is the parameter vector of that classifier, with each of its elements w kj
modelling the generative probability for its associated class j . As a consequence,
its elements have to be non-negative and sum up to 1.
The conjugate prior p ( w k ) on a classifier's parameters is given by the Dirichlet
distribution
w k R
α )= C ( α )
j
w α j 1
kj
p ( w k )=Dir( w k |
,
(7.113)
D Y , that is equivalent for all classifiers, due to
the lack of better knowledge. Its normalising constant C ( α )isgivenby
parametrised by the vector α
R
Γ( α )
j Γ( α j )) ,
C ( α )=
(7.114)
where α denotes the sum of all elements of α ,thatis
α =
j
α j .
(7.115)
( w k )= α , and thus the elements of α allow us
to specify a prior bias towards one or the other class. Usually, nothing is known
about the class distribution for different areas of the input space, and so all
elements of α should be set to the same value.
In contrast to the relation of the different elements of α to each other, their
absolute magnitude specifies the strength of the prior, that is, how strongly the
prior affects the posterior in the light of further evidence. Intuitively speaking,
a change of 1 to an element of α represents one observation of the associated
class. Thus, to keep the prior non-informative it should be set to small positive
values, such as, for example, α =(10 2 ,..., 10 2 ) T .
Besides a different classifier model, no further modifications are required to
the Bayesian LCS model. Its hidden variables are now U =
E
Given this prior, we have
{
W , Z , V , β
}
,where
W =
is the set of the classifier's parameters, whose distribution factorises
with respect to k ,thatis
{
w k }
p ( W )=
k
p ( w k ) .
(7.116)
Assuming knowledge of X and
M
, the joint distribution of data and hidden
variables is given by
p ( Y , U
|
X )= p ( Y
|
X , W , Z ) p ( W ) p ( Z
|
X , V ) p ( V
|
β ) p ( β ) .
(7.117)
 
Search WWH ::




Custom Search