A Probabilistic Model for LCS - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

Firstly, the standard MoE model [121] is introduced, and its training and

expert localisation is discussed. This is followed in Sect. 4.2 by a discussion of

expert models for both regression and classification. To relate MoE to LCS,

the MoE model is generalised in Sect. 4.3, together with how its training has to

be modified to accommodate these generalisations. Identifying di culties with

the latter, a modified training scheme is introduced in Sect. 4.4, that makes the

introduced model more similar to XCS.

4.1

The Mixtures-of-Experts Model

The MoE model is probably best explained from the generative point-of-view:

given a set of K experts, each observation in the training set is assumed to be

generated by one and only one of these experts. Let z =( z 1 ,...,z K ) T be a

random binary vector, where each of its elements z k is associated with an expert

and indicates whether that expert generated the given observation ( x , y ). Given

that expert k generated the observation, then z j =1for j = k ,and z j =0

otherwise, resulting in a 1-of- K structure of z . The introduced random vector

is a latent variable , as its values cannot be directly observed. Each observation

( x n , y n ) in the training set has such a random vector z n associated with it,

and Z =

denotes the set of latent variables corresponding to each of the

observations in the training set.

Each expert provides a probabilistic mapping

{

z n }

X→Y

that is given by the

conditional probability density p ( y

x , θ k ), that is, the probability of the output

being vector y , given the input vector x and the model parameters θ k of expert k .

Depending on whether we deal with regression or classification tasks, experts can

represent different parametric models. Leaving the expert models unspecified for

now, linear regression and classification models will be introduced in Sect. 4.2.

4.1.1 Likelihood for Known Gating

A common approach to training probabilistic models is to maximise the like-

lihood of the outputs given the inputs and the model parameters, a principle

known as maximum likelihood . As will be shown later, maximum likelihood trai-

ning is equivalent to minimising the empirical risk, with a loss function depending

on the probabilistic formulation of the model.

Following the standard assumptions of independent observations, and additio-

nally assuming knowledge of the values of the latent variables Z , the likelihood

of the training set is given by

p ( Y

X , Z , θ )=

p ( y n |

x n , z n , θ ) ,

(4.1)

n =1

where θ stands for the model parameters. Due to the 1-of- K structure of each

z n , the likelihood for the n th observation is given by

x n , θ k ) z nk ,

p ( y n |

x n , z n , θ )=

p ( y n |

(4.2)

k =1

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home