K Nearest Neighbor Edition to Guide Classification Tree Learning: Motivation and Experimental Results - Data Mining: Theory, Methodology, Techniques, and Applications - page 59

Database Reference

In-Depth Information

to class “-” is very low, so a new split in the tree is not considered. But, after the

duplication of the lonely instance, the density of examples belonging to its class

grows, making possible a further split of the tree and the building of different

decision borders.

If the two sources of instability above mentioned were generated at random,

no improvement in the final accuracy might be expected. We wanted to test if

instability generated according to the cases misclasified by other algorithm ( k -

NN) could lead to a improvement over the accuracy yielded by the original ID3.

In the next section are the experimental results we obtained.

5

Experimental Results

Ten databases are used to test our hypothesis. All of them are obtained from the

UCI Machine Learning Repository [2]. These domains are public at the Statlog

project WEB page [18]. The characteristics of the databases are given in Table 1.

As it can be seen, we have chosen different types of databases, selecting some

of them with a large number of predictor variables, or with a large number of

cases and some multi-class problems.

Table 1. Details of databases

Database

Number of Number of Number of

cases

classes attributes

Diabetes

768

2

8

Australian

690

2

14

Heart

270

2

13

Monk2

432

2

6

Wine

178

3

13

Zoo

101

7

16

Waveform-21

5000

3

21

Nettalk

14471

324

203

Letter

20000

26

16

Shuttle

58000

7

9

In order to give a real perspective of applied methods, we use 10-Fold Cross-

validation [29] in all experiments. All databases have been randomly separated

into ten sets of training data and its corresponding test data. Obviously all the

validation files used have been always the same for the two algorithms: ID3

and our approach, k -NN-boosting. Ten executions for every 10-fold set have

been carried out using k -NN-boosting, one for each different K ranging from

1 to 10. In Table 2 a comparative of ID3 error rate, as well as the best and

worst performance of k -NN-boosting, along with the average error rate among

the ten first values of K, used in the experiment, is shown. The cases when k -NN-

boosting outperforms ID3 are drawn in boldface. Let us note that in six out of ten

databases the average of the ten sets of executions of k -NN-boosting outperforms

ID3 and in two of the remaining four cases the performance is similar.

Next Page

Data Mining: Theory, Methodology, Techniques, and Applications

Search WWH ::

Custom Search

Home