Ensembles of Least Squares Classifiers with Randomized Kernels - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

A well known avenue to improve the accuracy of an ensemble is to replace

the simple averaging of individual experts by a weighting scheme. Instead

of giving equal weight to each expert, the outputs of more reliable experts

are weighted up (even for a classification problem). Linear regression can be

applied to learn these weights.

To avoid overfitting, the training material to learn this regression should

be produced by passing only such samples through an expert, that did not

participate in construction of the particular expert. Typically this is done by

using a separate validation data set. Since some of the datasets used were

very small in size, it was not useful to split the training sets further for this

purpose. Instead, since each expert is constructed only from a fraction of the

training data set, the rest of the data is available as “out-of-bag samples”

(OOB).

We experimented with two schemes to construct the training data matrix

in order to learn the weights. The matrix consists of outputs of each indi-

vidual member of the ensemble, Each row corresponds to a data sample in

the training set, and each column corresponds to one expert of the ensemble.

Since each expert populates the column only with OOB-samples, the empty

spaces corresponding to the training data of the expert can be filled in either

with zeroes, or with the expert outputs by passing the training data through

the expert. The latter is optimistically biased, and the former is biased toward

zero (the “don't know” condition). In the latter case we also up-weighted the

entries by the reciprocal of the fraction of missing entries in order to com-

pensate for the inner product of the regression coe cients with the entries to

sum to either plus or minus one.

Since expert outputs are correlated (although the aim is to have uncorre-

lated experts), PCA regression can be applied to reduce the number of regres-

sion coe cients. Partial Least Squares regression could also be used instead of

PCA regression. We ended up using PCA regression in the final experiments.

4 Variable Filtering with Tree-Based Ensembles

Because the data sets contained unknown irrelevant variables (50-90% of the

variables were noise), we noticed significant improvement in accuracy when

only a small (but important) fraction of the original variables was used in the

kernel construction.

We used fast exploratory tree-based models for variable filtering. One of

many important properties of CART [5] is its embedded ability to select im-

portant variables during tree construction (greedy recursive partition, where

impurity reduction is maximized at every step), and therefore resistance to

noise. Variable importance then can be defined as

M ( x m ,T )=

t∈T

∆I ( x m ,t )

(6)

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home