Evaluation of Classification Trees - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

training and test sets. Usually, two-thirds of the data is considered for the

training set and the remaining data are allocated to the test set. First,

the training set is used by the inducer to construct a suitable classifier and

then we measure the misclassification rate of this classifier on the test set.

This test set error usually provides a better estimation of the generalization

error than the training error. The reason for this is the fact that the training

error usually under-estimates the generalization error (due to the overfitting

phenomena). Nevertheless, since only a proportion of the data is used to

derive the model, the estimate of accuracy tends to be pessimistic.

A variation of the holdout method can be used when data is limited.

It is common practice to resample the data, that is, partition the data into

training and test sets in different ways. An inducer is trained and tested for

each partition and the accuracies averaged. By doing this, a more reliable

estimate of the true generalization error of the inducer is provided.

Random subsampling and n -fold cross-validation are two common

methods of resampling. In random subsampling, the data is randomly

partitioned several times into disjoint training and test sets. Errors obtained

from each partition are averaged. In n -fold cross-validation, the data is

randomly split into n mutually exclusive subsets of approximately equal

size. An inducer is trained and tested n times; each time it is tested on one

of the k folds and trained using the remaining n

1folds.

The cross-validation estimate of the generalization error is the overall

number of misclassifications divided by the number of examples in the data.

The random subsampling method has the advantage that it can be repeated

an indefinite number of times. However, a disadvantage is that the test sets

are not independently drawn with respect to the underlying distribution of

examples. Because of this, using a t -test for paired differences with random

subsampling can lead to an increased chance of type I error, i.e. identifying

a significant difference when one does not actually exist. Using a t -test on

the generalization error produced on each fold lowers the chances of type

I error but may not give a stable estimate of the generalization error. It

is common practice to repeat n -fold cross-validation n times in order to

provide a stable estimate. However, this, of course, renders the test sets

non-independent and increases the chance of type I error. Unfortunately,

there is no satisfactory solution to this problem. Alternative tests suggested

by Dietterich (1998) have a low probability of type I error but a higher

chance of type II error, that is, failing to identify a significant difference

when one does actually exist.

Stratification is a process often applied during random subsampling

and n -fold cross-validation. Stratification ensures that the class distribution

−

Search WWH ::

Custom Search

Home