A Probabilistic Model for LCS - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

where

are

the D X -dimensional weight vector w k and the noise precision (that is, inverse

variance) τ k . The distribution is centred on the inner product w k x ,andits

spread is inversely proportional to τ k and independent of the input.

As we give a detailed discussion about the implications of assuming this expert

model and various forms of its incremental training in Chap. 5, let us here only

consider how it specifies the maximisation step of the EM-algorithm for training

the MoE model, in particular with respect to the weight vector w k : Combining

(4.14) and (4.15), the term to maximise becomes

stands for a Gaussian, and the model parameters θ k =

{

w k ,τ k }

r nk 1

y n ) 2

2 ln τ k

τ k

2 ( w k x n −

r nk ln p ( y n |

x n , w k ,τ k )=

2 π −

n =1

k =1

n =1

k =1

τ k

r nk ( w k x n −

y n ) 2 +const. ,

−

k =1

n =1

where the constant terms absorbs all terms that are independent of the weight

vectors. Considering the experts separately, the aim for expert k is to find

r nk ( w k x n −

y n ) 2 ,

min

w k

(4.16)

n =1

which is a weighted linear least squares problem. This shows how the assumption

of a Gaussian noise locally leads to minimising the empirical risk with the L 2

loss function.

4.2.2

Experts for Classification

For classification, assume that the number of classes is D Y , and the outputs are

the vectors y =( y 1 ,...,y D Y ) T with all elements being y j = 0, except for the

element y j =1,where j is the class associated with this output vector. Thus,

similarly to the latent vector z , the different y 's obey a 1-of- D Y structure.

The expert model p ( y

x , θ k ) gives the probability of the expert having gene-

rated an observation of the class specified by y . Analogous to the gating network

(4.4), this model could assume a log-linear relationship between this probability

and the input x , which implies that p ( y

x , θ k ) is assumed to vary with x .Howe-

ver, to simplify interpretation of the expert model, it will be assumed that this

probability remains constant over all inputs that the expert is responsible for,

that is

D Y

w y kj ,

p ( y

x , w k )=

with

w kj =1 .

(4.17)

j =1

Thus, p ( y

x , w k ) is independent of the input x and parametrised by θ k = w k ,

and for any given y representing class j , the model's probability is given by w j ,

the j th element of w k .

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home