Ensembles of Least Squares Classifiers with Randomized Kernels - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

3 Model Averaging and Regularization

We discuss now what properties are required for the base learners to make

an effective ensemble, and how those properties can be attained with least

squares classifiers (LSC).

3.1 Stability

Generalization ability of a learned function is closely related to its stability.

Stability of the solution could be loosely defined as a continuous dependence

on the data. A stable solution changes very little for small changes in the data.

A comprehensive treatment of this connection can be found in [2].

Furthermore, it is well known that bagging (bootstrap aggregation) can

dramatically reduce variance of unstable learners providing some regulariza-

tion effect [3]. Bagged ensembles do not overfit. The key to the performance

is a low bias of the base learner, and a low correlation between base learners.

Evgeniou experimented with ensembles of SVMs [8]. He used a few datasets

from UCI tuning all parameters separately for both a single SVM and for

an ensemble of SVMs to achieve the best performance. He found that both

perform similarly. However, he also found that generalization bounds for en-

sembles are tighter than for a single machine.

Poggio et al. studied the relationship between stability and bagging [13].

They showed that there is a bagging scheme, where each expert is trained

on a disjoint subset of the training data, providing strong stability to ensem-

bles of non-strongly stable experts, and therefore providing the same order

of convergence for the generalization error as Tikhonov regularization. Thus,

at least asymptotically, bagging strongly stable experts would not improve

generalization ability of the individual member.

3.2 Ensembles of RLSCs

An ensemble should thus have diverse experts with low bias. For RLSC, the

bias is controlled by the regularization parameter and by the σ in case of

a Gaussian kernel. Instead of bootstrap sampling from training data which

imposes a fixed sampling strategy, we found that often much smaller sample

sizes of the order of 30-50% of the data set size improve performance. A

further source of diversity is introduced by each expert having a different

random kernel width.

Combining the outputs of the experts in an ensemble can be done in sev-

eral ways. The simplest alternative is majority voting over the outputs of the

experts. In binary classification this is equivalent to averaging the discretized

(+1,

1) predictions of the experts. In our experiments this performed bet-

ter than averaging the actual numeric expert outputs before applying their

decision function (sign).

−

Search WWH ::

Custom Search

Home