Consolidated Trees: An Analysis of Structural Convergence - Data Mining: Theory, Methodology, Techniques, and Applications

Database Reference

In-Depth Information

construction of multiple classifiers such as bagging, boosting, etc; able to obtain

larger accuracy in the classification [1],[3],[6],[10].

In all the mentioned cases, subsamples obtained by resampling the original data set

will be given to the learning algorithm in order to build a classifier. This resampling

affects severely the behaviour of the classification algorithms [12]. Classification

trees are not an exception. Classification trees induced from slightly different

subsamples of the same data set are very different in accuracy and structure [8]. This

weakness is called unsteadiness or instability. The stability is of capital importance in

many domains, such as illness diagnosis, fraud detection in different fields,

customer's behaviour analysis (marketing), etc, where comprehensibility of the

classifier is necessary [7]. As Turney found working on industrial applications of

decision tree learning, “the engineers are disturbed when different batches of data

from the same process result in radically different decision trees. The engineers lose

confidence in the decision trees even when we can demonstrate that the trees have

high predictive accuracy” [17]. Some authors [7],[17] have measured the stability of a

classifier observing if different instances agree in the prediction made for each case of

the test set (logical stability or variance). However, since the explanation of a tree

comes from its structure we need a way of building structurally steady classifiers in

order to obtain a convincing explanation (physical stability or structural stability).

This paper presents an analysis of the structural stability of decision trees built

using the Consolidated Trees' Construction algorithm (CTC). The CTC algorithm,

opposite to other algorithms that work with many subsamples (bagging, boosting),

induces a single tree, therefore it does not lose the comprehensibility of the base

classifier. A measure of similarity between two induced classifiers (tree's structures)

will be used in order to evaluate the structural stability of the algorithm. The structural

analysis done shows that the algorithm has a steadier behaviour than C4.5 [15],

obtaining this way a steadier explanation. In this paper the main focus is done in

showing how the trees built with the proposed algorithm tend to become more similar

when the number of subsamples used to build them increases. In some domains, they

converge to the same instance of tree even if different subsamples are used.

The discriminating capacity of the CTC algorithm has already been evaluated in

previous works [13],[14]. These works show that the classification trees generated

using the CTC algorithm achieve smaller error rates than the ones built with C4.5,

giving this way a better quality to the explanation.

The paper proceeds with a description of our methodology for building

classification trees, Section 2. In Section 3, the description of the data set and the

experimental set-up is presented. This paper includes a summary of the results of our

previous work in Section 4. Section 5 presents the analysis of the structural stability

and convergence of the structure of trees built with CTC algorithm. Finally, Section 6

is devoted to summarise the conclusions and further work.

2 Consolidated Trees' Construction Algorithm

Consolidated Trees' Construction algorithm (CTC) uses several subsamples to build a

single tree. This technique is radically different from bagging, boosting, etc. The con-

sensus is achieved at each step of the tree's building process and only one tree is built.

Data Mining: Theory, Methodology, Techniques, and Applications

Search WWH ::

Custom Search

Home