The Optimal Set of Classifiers - Design and Analysis of Learning Classifier Systems

Information Technology Reference

In-Depth Information

7.6.3

Bayesian Ying-Yang

Bayesian Ying Yang (BYY) defines a unified framework that lets one derive

many statistics-based machine learning methods [243]. It describes the proba-

bility distribution given by the data, and the one described by the model, and

aims at finding models that are closest in distribution to the data. Using the

Kullback-Leibler divergence as a distribution comparison metric results in ma-

ximum likelihood learning, and therefore will cause overfitting of the model. An

alternative is Harmony Learning which is based on minimising the cross entropy

between the data distribution and the model distribution, and prefers statisti-

cally simple distributions, that is, distributions of low entropy.

Even though it is very likely applicable to LCS as it has already been applied

to the Mixtures-of-Expert model [242], there is no clear philosophical background

that justifies the use of the cross entropy. Therefore, the Bayesian approach that

was introduced in this chapter seems to be a better alternative.

7.6.4

Training Data-Based Approaches

It has been shown that penalising the model complexity based on some structu-

ral properties of the model alone cannot compete on all scales with data-based

methods like cross validation [125]. Furthermore, using the training data rather

than an independent test set gives even better results in minimising the expec-

ted risk [13]. Two examples of such complexity measures are the Rademacher

complexity and the Gaussian complexity [14]. Both of them are defined as the ex-

pected error of the model when trying to fit the data perturbed by a sequence of

either Rademacher random variables (uniform over

(0 , 1)

random variables. Hence, they measure the model complexity by the model's

ability to match a noisy sequence.

Using such methods in LCS would require training two models for the same

model structure, where one is trained with the normal training data, and the

other with the perturbed data. It is questionable if such additional space and

computational effort justifies the application of the methods. Furthermore, using

sampling of random variables to find the model complexity makes it impossible

to find an analytical expression for the utility of the model and thus provides

little insight in how a particular model structure is selected. Nonetheless, it might

still be of use as a benchmark method.

{±

1

}

) or Gaussian

N

7.7

Discussion and Summary

This chapter tackled the core question of LCS: what is the best set of classifiers

that explains the given data? Rather than relying on intuition, this question was

approached formally by aiming to find the best model structure

M

that explains

the given data

. More specifically, the principles of Bayesian model selection

were applied to define the best set of classifiers to be the most likely one given

the data, that is, the one that maximises p (

D

M|D

).

Design and Analysis of Learning Classifier Systems

Search WWH ::

Custom Search

Home