The Optimal Set of Classifiers - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

when maximising with respect to U i , where the expectation is taken with respect

to all hidden variables except for U i , and the constant term is the logarithm of

the normalisation constant of q i [19, 118]. In our case, we group the variables

according to their priors by

{

W , τ

}

{

}

{

}

{

}

{

}

Handling the Softmax Function

If the model has a conjugate-exponential structure, (7.24) gives an analytical

solution with a distribution form equal to the prior of the corresponding hid-

den variable. However, in our case the generalised softmax function (7.10) does

not conform to this conjugate-exponential structure, and needs to be dealt with

separately. A possible approach is to replace the softmax function by an expo-

nential lower bound on it, which consequently introduces additional variational

variables with respect to which L ( q ) also needs to be maximised. This approach

was followed By Bishop and Svensen [20] and Jaakkola and Jordan [119] for

the logistic sigmoid function, but currently there is no known exponential lower

bound function on the softmax besides a conjectured one by Gibbs [93] 4 .Alter-

natively, we can follow the approach taken by Waterhouse et al. [227, 226], where

q V ( V ) is approximated by a Laplace approximation. Due to the lack of better

alternatives, this approach is chosen, despite such an approximation invalidating

the lower bound nature of

( q ).

Update Equations and Model Posterior

To get the update equations for the parameters of the variational distribution,

we need to evaluate (7.24) for each group of hidden variables in U separately,

similar to the derivations by Waterhouse et al. [226] and Ueda and Ghahramani

[216]. This provides us with an approximation for the posterior p ( U

Y ) and will

be shown in the following sections.

Approximating the model evidence p ( Y ) requires a closed-form expression for

( q ) by evaluating (7.21), where many terms of the variational update equations

can be reused, as will be shown after having derived the update equations.

Classifier Model q ∗ W,τ

( W, τ )

7.3.2

The maximum of

( q ) with respect to W and τ is given by evaluating (7.24)

for q W,τ , which, by using (7.15), (7.16) and (7.6), results in

ln q ∗ W,τ ( W , τ )=

E Z (ln p ( Y

W , τ , Z )) +

E α (ln 0 p ( W , τ

α )) + const.

E Z ( z nk ln p ( y n | W k ,τ k ))

E α (ln p ( W k ,τ k |

α k )) + const. ,

(7.25)

4 A more general bound was recently developed by Wainwright, Jaakkola and Willsky

[225], but its applicability still needs to be evaluated.

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home