Decision Forests - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

tuples from the original training set and also on some fabricated tuples.

In each iteration, the input attribute values of the fabricated tuples are

generated according to the original data distribution. On the other hand,

the target values of these tuples are selected so as to differ maximally

from the current ensemble predictions. Comprehensive experiments have

demonstrated that this technique is consistently more accurate than the

base classifier, Bagging and Random Forests. Decorate also obtains higher

accuracy than boosting on small training sets, and achieves comparable

performance on larger training sets.

9.5.2.3 Partitioning

Some argue that classic ensemble techniques (such as boosting and bagging)

have limitations on massive datasets, because the size of the dataset can

become a bottleneck [ Chawla et al . (2004) ] . Moreover, it is suggested

that partitioning the datasets into random, disjoint partitions will not

only overcome the issue of exceeding memory size, but will also lead to

creating an ensemble of diverse and accurate classifiers, each built from a

disjoint partition but with the aggregate processing all of the data. This can

improve performance in a way that might not be possible by subsampling.

In fact, empirical studies have shown that the performance of the multiple

disjoint partition approach is equivalent to the performance obtained by

popular ensemble techniques such as bagging. More recently, a framework

for building thousands of classifiers that are trained from small subsets

of data in a distributed environment was proposed [ Chawla et al . (2004) ] .

It has been empirically shown that this framework is fast, accurate, and

scalable.

Clustering techniques can be used to partitioning the sample. The

cluster-based concurrent decomposition (CBCD) algorithm first clusters the

instance space by using the K-means clustering algorithm. Then, it creates

disjoint sub-samples using the clusters in such a way that each sub-sample is

comprised of tuples from all clusters and hence represents the entire dataset.

An inducer is applied in turn to each sub-sample. A voting mechanism is

used to combine the classifiers classifications. Experimental study indicates

that the CBCD algorithm outperforms the bagging algorithm.

9.5.3

Manipulating the Target Attribute Representation

In methods that manipulate the target attribute, instead of inducing a

single complicated classifier, several classifiers with different and usually

Search WWH ::

Custom Search

Home