Evaluation of Classification Trees - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

•

Ecient storage methods — enabling the algorithm to handle many

records. For instance, Shafer et al . (1996) presented the SPRINT which

constructs an attribute list data structure.

•

Reducing the algorithm's search space — For instance, the PUBLIC

algorithm [ Rastogi and Shim (2000) ] integrates the growing and pruning

of decision trees by using Minimum Description Length (MDL) approach

in order to reduce the computational complexity.

4.6 Robustness

The ability of the model to handle noise or data with missing values

and make correct predictions is called robustness. Different decision trees

algorithms have different robustness levels. In order to estimate the

robustness of a classification tree, it is common to train the tree on a clean

training set and then train a different tree on a noisy training set. The

noisy training set is usually the clean training set to which some artificial

noisy instances have been added. The robustness level is measured as the

difference in the accuracy of these two situations.

4.7 Stability

Formally, stability of a classification algorithm is defined as the degree to

which an algorithm generates repeatable results, given different batches of

data from the same process. In mathematical terms, stability is the expected

agreement between two models on a random sample of the original data,

where agreement on a specific example means that both models assign it to

the same class. The instability problem raises questions about the validity

of a particular tree provided as an output of a decision-tree algorithm. The

users view the learning algorithm as an oracle. Obviously, it is dicult to

trust an oracle that says something radically different each time you make

a slight change in the data.

Existing methods of constructing decision trees from data suffer from a

major problem of instability. The symptoms of instability include variations

in the predictive accuracy of the model and in the model's topology.

Instability can be revealed not only by using disjoint sets of data, but even

by replacing a small portion of training cases, like in the cross-validation

procedure. If an algorithm is unstable, the cross-validation results become

estimators with high variance which means that an algorithm fails to make

a clear distinction between persistent and random patterns in the data. As

Search WWH ::

Custom Search

Home