Evaluation of Classification Trees - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

(2) The test does not measure variation due to the choice of the training

set or the internal variation of the learning algorithm. Also it measures

the performance of the algorithms on training sets of a size significantly

smaller than the whole dataset.

4.2.7.3

The Resampled Paired t Test

The resampled paired t test is the most popular in machine learning.

Usually, there are a series of 30 trials in the test. In each trial, the available

sample S is randomly divided into a training set R (it is typically two

thirds of the data) and a test set T . The algorithms A and B are both

trained on R and the resulting classifiers are tested on T .Let p ( i A and p ( i B

be the observed proportions of test examples misclassified by algorithm A

and B respectively during the i th trial. If we assume that the 30 differences

p ( i ) = p ( i A −

p ( i B were drawn independently from a normal distribution, then

we can apply Student's t test by computing the statistic:

· √ n

P i =1 ( p ( i ) −p ) 2

n− 1

p

t =

,

(4.26)

where p = n · i =1

p ( i ) . Under the null hypothesis, this statistic has a

t distribution with n

1 degrees of freedom. Then for 30 trials, the null

hypothesis could be rejected if

−

|

t

|

>t 29 , 0 . 975 =2 . 045. The main drawbacks

of this approach are:

(1) Since p ( i A and p ( i B are not independent, the difference p ( i ) will not have

a normal distribution.

(2) The p ( i ) 's are not independent, because the test and training sets in

the trials overlap.

4.2.7.4

The k-fold Cross-validated Paired t Test

This approach is similar to the resampled paired t test except that instead of

constructing each pair of training and test sets by randomly dividing S ,the

dataset is randomly divided into k disjoint sets of equal size, T 1 ,T 2 ,...,T k .

Then k trials are conducted. In each trial, the test set is T i and the training

set is the union of all of the others T j , j

= i .The t statistic is computed

as described in Section 4.2.7.3. The advantage of this approach is that

each test set is independent of the others. However, there is the problem

that the training sets overlap. This overlap may prevent this statistical test

from obtaining a good estimation of the amount of variation that would

Data Mining with Decision Trees: Theory and Applications

Search WWH ::

Custom Search

Home