A Probabilistic Model for LCS - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

where z nk is the k th element of z n . As only one element of z n can be 1, the

above expression is equivalent to the j th expert model such that z nj =1.

As the logarithm function is monotonically increasing, maximising the loga-

rithm of the likelihood is equivalent to maximising the likelihood. Combining

(4.1) and (4.2), the log-likelihood ln p ( Y

X , Z , θ )resultsin

ln p ( Y

X , Z , θ )=

z nk ln p ( y n |

x n , θ k ) .

(4.3)

n =1

k =1

Inspecting (4.3) we can see that each observation n is assigned to the single

expert for which z nk = 1. Hence, it is maximised by maximising the likelihood

of the expert models separately, for each expert based on its assigned set of

observations.

4.1.2 Parametric Gating Network

As the latent variables Z are not directly observable, we do not know the values

that they take and therefore cannot maximise the likelihood introduced in the

previous section directly. Rather, a parametric model for Z ,knownasthe gating

network , is used instead and trained in combination with the experts.

The gating network used in the standard MoE model is based on the assump-

tion that the probability of an expert having generated the observation ( x , y )is

log-linearly related to the input x .Thisisformulatedby

exp( v k x ) , (4.4)

stating that the probability of expert k having generated observation ( x , y )is

proportional to the exponential of the inner product of the input x and the

gating vector v k of the same size as x . Normalising p ( z k =1

g k ( x )

≡

p ( z k =1

x , v k )

∝

| x , v k ), we get

exp( v k x )

j =1 exp( v j x )

g k ( x )

≡

p ( z k =1

x , v k )=

(4.5)

which is the well-known softmax function, and corresponds to the multinomial

logit model in Statistics that is often used to model consumer choice [165]. It

is parametrised by one gating vector v k per expert, in combination forming the

set V =

. Fig. 4.1 shows the directed graphical model that illustrates the

structure and variable dependencies of the Mixtures-of-Experts model.

To get the log-likelihood l ( θ ;

{ v k }

X , θ ), we use the 1-of- K structure

of z to express the probability of having a latent random vector z for a given

input x and a set of gating parameters V by

)

≡

ln p ( Y

x , v k ) z k =

g k ( x ) z k .

p ( z

x , V )=

p ( z k =1

(4.6)

k =1

Thus, by combining (4.2) and (4.6), the joint density over y and z is given by

g k ( x ) z k p ( y

x , θ k ) z k .

p ( y , z

x , θ )=

(4.7)

k =1

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home