Information Technology Reference
In-Depth Information
7.6.3
Bayesian Ying-Yang
Bayesian Ying Yang (BYY) defines a unified framework that lets one derive
many statistics-based machine learning methods [243]. It describes the proba-
bility distribution given by the data, and the one described by the model, and
aims at finding models that are closest in distribution to the data. Using the
Kullback-Leibler divergence as a distribution comparison metric results in ma-
ximum likelihood learning, and therefore will cause overfitting of the model. An
alternative is Harmony Learning which is based on minimising the cross entropy
between the data distribution and the model distribution, and prefers statisti-
cally simple distributions, that is, distributions of low entropy.
Even though it is very likely applicable to LCS as it has already been applied
to the Mixtures-of-Expert model [242], there is no clear philosophical background
that justifies the use of the cross entropy. Therefore, the Bayesian approach that
was introduced in this chapter seems to be a better alternative.
7.6.4
Training Data-Based Approaches
It has been shown that penalising the model complexity based on some structu-
ral properties of the model alone cannot compete on all scales with data-based
methods like cross validation [125]. Furthermore, using the training data rather
than an independent test set gives even better results in minimising the expec-
ted risk [13]. Two examples of such complexity measures are the Rademacher
complexity and the Gaussian complexity [14]. Both of them are defined as the ex-
pected error of the model when trying to fit the data perturbed by a sequence of
either Rademacher random variables (uniform over
(0 , 1)
random variables. Hence, they measure the model complexity by the model's
ability to match a noisy sequence.
Using such methods in LCS would require training two models for the same
model structure, where one is trained with the normal training data, and the
other with the perturbed data. It is questionable if such additional space and
computational effort justifies the application of the methods. Furthermore, using
sampling of random variables to find the model complexity makes it impossible
to find an analytical expression for the utility of the model and thus provides
little insight in how a particular model structure is selected. Nonetheless, it might
still be of use as a benchmark method.
1
}
) or Gaussian
N
7.7
Discussion and Summary
This chapter tackled the core question of LCS: what is the best set of classifiers
that explains the given data? Rather than relying on intuition, this question was
approached formally by aiming to find the best model structure
M
that explains
the given data
. More specifically, the principles of Bayesian model selection
were applied to define the best set of classifiers to be the most likely one given
the data, that is, the one that maximises p (
D
M|D
).
 
Search WWH ::




Custom Search