Database Reference
In-Depth Information
tuples from the original training set and also on some fabricated tuples.
In each iteration, the input attribute values of the fabricated tuples are
generated according to the original data distribution. On the other hand,
the target values of these tuples are selected so as to differ maximally
from the current ensemble predictions. Comprehensive experiments have
demonstrated that this technique is consistently more accurate than the
base classifier, Bagging and Random Forests. Decorate also obtains higher
accuracy than boosting on small training sets, and achieves comparable
performance on larger training sets.
9.5.2.3 Partitioning
Some argue that classic ensemble techniques (such as boosting and bagging)
have limitations on massive datasets, because the size of the dataset can
become a bottleneck [ Chawla et al . (2004) ] . Moreover, it is suggested
that partitioning the datasets into random, disjoint partitions will not
only overcome the issue of exceeding memory size, but will also lead to
creating an ensemble of diverse and accurate classifiers, each built from a
disjoint partition but with the aggregate processing all of the data. This can
improve performance in a way that might not be possible by subsampling.
In fact, empirical studies have shown that the performance of the multiple
disjoint partition approach is equivalent to the performance obtained by
popular ensemble techniques such as bagging. More recently, a framework
for building thousands of classifiers that are trained from small subsets
of data in a distributed environment was proposed [ Chawla et al . (2004) ] .
It has been empirically shown that this framework is fast, accurate, and
scalable.
Clustering techniques can be used to partitioning the sample. The
cluster-based concurrent decomposition (CBCD) algorithm first clusters the
instance space by using the K-means clustering algorithm. Then, it creates
disjoint sub-samples using the clusters in such a way that each sub-sample is
comprised of tuples from all clusters and hence represents the entire dataset.
An inducer is applied in turn to each sub-sample. A voting mechanism is
used to combine the classifiers classifications. Experimental study indicates
that the CBCD algorithm outperforms the bagging algorithm.
9.5.3
Manipulating the Target Attribute Representation
In methods that manipulate the target attribute, instead of inducing a
single complicated classifier, several classifiers with different and usually
Search WWH ::




Custom Search