Databases Reference
In-Depth Information
A well known avenue to improve the accuracy of an ensemble is to replace
the simple averaging of individual experts by a weighting scheme. Instead
of giving equal weight to each expert, the outputs of more reliable experts
are weighted up (even for a classification problem). Linear regression can be
applied to learn these weights.
To avoid overfitting, the training material to learn this regression should
be produced by passing only such samples through an expert, that did not
participate in construction of the particular expert. Typically this is done by
using a separate validation data set. Since some of the datasets used were
very small in size, it was not useful to split the training sets further for this
purpose. Instead, since each expert is constructed only from a fraction of the
training data set, the rest of the data is available as “out-of-bag samples”
(OOB).
We experimented with two schemes to construct the training data matrix
in order to learn the weights. The matrix consists of outputs of each indi-
vidual member of the ensemble, Each row corresponds to a data sample in
the training set, and each column corresponds to one expert of the ensemble.
Since each expert populates the column only with OOB-samples, the empty
spaces corresponding to the training data of the expert can be filled in either
with zeroes, or with the expert outputs by passing the training data through
the expert. The latter is optimistically biased, and the former is biased toward
zero (the “don't know” condition). In the latter case we also up-weighted the
entries by the reciprocal of the fraction of missing entries in order to com-
pensate for the inner product of the regression coe cients with the entries to
sum to either plus or minus one.
Since expert outputs are correlated (although the aim is to have uncorre-
lated experts), PCA regression can be applied to reduce the number of regres-
sion coe cients. Partial Least Squares regression could also be used instead of
PCA regression. We ended up using PCA regression in the final experiments.
4 Variable Filtering with Tree-Based Ensembles
Because the data sets contained unknown irrelevant variables (50-90% of the
variables were noise), we noticed significant improvement in accuracy when
only a small (but important) fraction of the original variables was used in the
kernel construction.
We used fast exploratory tree-based models for variable filtering. One of
many important properties of CART [5] is its embedded ability to select im-
portant variables during tree construction (greedy recursive partition, where
impurity reduction is maximized at every step), and therefore resistance to
noise. Variable importance then can be defined as
M ( x m ,T )=
t∈T
∆I ( x m ,t )
(6)
Search WWH ::




Custom Search