Databases Reference
In-Depth Information
3 Model Averaging and Regularization
We discuss now what properties are required for the base learners to make
an effective ensemble, and how those properties can be attained with least
squares classifiers (LSC).
3.1 Stability
Generalization ability of a learned function is closely related to its stability.
Stability of the solution could be loosely defined as a continuous dependence
on the data. A stable solution changes very little for small changes in the data.
A comprehensive treatment of this connection can be found in [2].
Furthermore, it is well known that bagging (bootstrap aggregation) can
dramatically reduce variance of unstable learners providing some regulariza-
tion effect [3]. Bagged ensembles do not overfit. The key to the performance
is a low bias of the base learner, and a low correlation between base learners.
Evgeniou experimented with ensembles of SVMs [8]. He used a few datasets
from UCI tuning all parameters separately for both a single SVM and for
an ensemble of SVMs to achieve the best performance. He found that both
perform similarly. However, he also found that generalization bounds for en-
sembles are tighter than for a single machine.
Poggio et al. studied the relationship between stability and bagging [13].
They showed that there is a bagging scheme, where each expert is trained
on a disjoint subset of the training data, providing strong stability to ensem-
bles of non-strongly stable experts, and therefore providing the same order
of convergence for the generalization error as Tikhonov regularization. Thus,
at least asymptotically, bagging strongly stable experts would not improve
generalization ability of the individual member.
3.2 Ensembles of RLSCs
An ensemble should thus have diverse experts with low bias. For RLSC, the
bias is controlled by the regularization parameter and by the σ in case of
a Gaussian kernel. Instead of bootstrap sampling from training data which
imposes a fixed sampling strategy, we found that often much smaller sample
sizes of the order of 30-50% of the data set size improve performance. A
further source of diversity is introduced by each expert having a different
random kernel width.
Combining the outputs of the experts in an ensemble can be done in sev-
eral ways. The simplest alternative is majority voting over the outputs of the
experts. In binary classification this is equivalent to averaging the discretized
(+1,
1) predictions of the experts. In our experiments this performed bet-
ter than averaging the actual numeric expert outputs before applying their
decision function (sign).
Search WWH ::




Custom Search