Decision Forests - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

•

Level 3 — A majority vote will not always correctly classify a given

instance, but at least one ensemble member always correctly classifies it.

•

Level 4 — The function is not always covered by the members of the

ensemble.

Brown et al . (2005) claim that the above four-level scheme provides

no indication of how typical the error behavior described by the assigned

diversity level is. This claim, especially, holds when the ensemble exhibits

different diversity levels on different subsets of instance space.

There are other more quantative measures which categorize these

measures into two types [ Brown et al . (2005) ] : pairwise and non-pairwise.

Pairwise measures calculate the average of a particular distance metric

between all possible pairings of members in the ensemble, such as Q-statistic

[ Brown et al . (2005) ] or kappa-statistic [ Margineantu and Dietterich

(1997) ] . The non-pairwise measures either use the idea of entropy (such

as [ Cunningham and Carney (2000) ] ) or calculate a correlation of each

ensemble member with the averaged output. The comparison of several

measures of diversity has resulted in the conclusion that most of them are

correlated [Kuncheva and Whitaker (2003)].

9.6 Ensemble Size

Selecting the Ensemble Size

9.6.1

An important aspect of ensemble methods is to define how many component

classifiers should be used. There are several factors that may determine this

size:

•

Desired accuracy — In most cases, ensembles containing 10 classifiers

are sucient for reducing the error rate [Hansen and Salamon (1990)].

Nevertheless, there is empirical evidence indicating that: when AdaBoost

uses decision trees, error reduction is observed in even relatively large

ensembles containing 25 classifiers [ Opitz and Maclin (1999) ] . In disjoint

partitioning approaches, there may be a trade-off between the number

of subsets and the final accuracy. The size of each subset cannot be too

small because sucient data must be available for each learning process

to produce an effective classifier.

•

Computational cost — Increasing the number of classifiers usually

increases computational cost and decreases their comprehensibility. For

that reason, users may set their preferences by predefining the ensemble

size limit.

Search WWH ::

Custom Search

Home