The Optimal Set of Classifiers - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

expression. Hence, we will make use of an approximation technique known as

variational Bayesian inference [119, 19] that provides us with such a closed-form

expression.

Alternatively, sampling techniques, such as Markov Chain Monte Carlo

(MCMC) methods, could be utilised to get an accurate posterior and model

evidence. However, the model structure search is expensive and requires a quick

evaluation of the model evidence for a given model structure, and therefore the

computational burden of sampling techniques makes approximating the model

evidence by variational methods a better choice.

For the remainder of this chapter, all distributions are treated as being im-

plicitly conditional on X and

, to keep the notation simple. Additionally, the

range for sums and products will not always be specified explicitly, as they are

usually obvious from their context.

7.3.1 Variational Bayesian Inference

The aim of Bayesian inference and model selection is, on one hand, to find

a variational distribution q ( U ) that approximates the true posterior p ( U

Y )

and, on the other hand, to get the model evidence p ( Y ). Variational Bayesian

inference is based on the decomposition [19, 118]

ln p ( Y )=

( q )+KL( q

p ) ,

(7.20)

( q )=

q ( U )ln p ( U , Y )

q ( U )

d U ,

(7.21)

q ( U )ln p ( U

Y )

q ( U )

KL( q

p )=

−

d U ,

(7.22)

which holds for any choice of q . As the Kullback-Leibler divergence KL( q

p )is

always non-negative, and zero if and only if p ( U

Y )= q ( U ) [232], the variational

bound

( q ) is a lower bound on ln p ( Y ) and only equivalent to the latter if

q ( U ) is the true posterior p ( U

Y ). Hence, the posterior can be approximated

by maximising the lower bound

( q ), which brings the variational distribution

closer to the true posterior and at the same time yields an approximation of the

model evidence by

( q )

≤

ln p ( Y ).

Factorial Distributions

To make this approach tractable, we need to choose a family of distributions

q ( U ) that gives an analytical solution. A frequently used approach (for example,

[20, 227]) that is suciently flexible to give a good approximation to the true

posterior is to use the set of distributions that factorises with respect to disjoint

groups U i of variables

q ( U )=

q i ( U i ) ,

(7.23)

which allows maximising

( q ) with respect to each group of hidden variables

separately while keeping the other ones fixed. This results in

ln q i ( U i )=

E i = j (ln p ( U , Y )) + const. ,

(7.24)

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home