Database Reference
In-Depth Information
construction of multiple classifiers such as bagging, boosting, etc; able to obtain
larger accuracy in the classification [1],[3],[6],[10].
In all the mentioned cases, subsamples obtained by resampling the original data set
will be given to the learning algorithm in order to build a classifier. This resampling
affects severely the behaviour of the classification algorithms [12]. Classification
trees are not an exception. Classification trees induced from slightly different
subsamples of the same data set are very different in accuracy and structure [8]. This
weakness is called unsteadiness or instability. The stability is of capital importance in
many domains, such as illness diagnosis, fraud detection in different fields,
customer's behaviour analysis (marketing), etc, where comprehensibility of the
classifier is necessary [7]. As Turney found working on industrial applications of
decision tree learning, “the engineers are disturbed when different batches of data
from the same process result in radically different decision trees. The engineers lose
confidence in the decision trees even when we can demonstrate that the trees have
high predictive accuracy” [17]. Some authors [7],[17] have measured the stability of a
classifier observing if different instances agree in the prediction made for each case of
the test set (logical stability or variance). However, since the explanation of a tree
comes from its structure we need a way of building structurally steady classifiers in
order to obtain a convincing explanation (physical stability or structural stability).
This paper presents an analysis of the structural stability of decision trees built
using the Consolidated Trees' Construction algorithm (CTC). The CTC algorithm,
opposite to other algorithms that work with many subsamples (bagging, boosting),
induces a single tree, therefore it does not lose the comprehensibility of the base
classifier. A measure of similarity between two induced classifiers (tree's structures)
will be used in order to evaluate the structural stability of the algorithm. The structural
analysis done shows that the algorithm has a steadier behaviour than C4.5 [15],
obtaining this way a steadier explanation. In this paper the main focus is done in
showing how the trees built with the proposed algorithm tend to become more similar
when the number of subsamples used to build them increases. In some domains, they
converge to the same instance of tree even if different subsamples are used.
The discriminating capacity of the CTC algorithm has already been evaluated in
previous works [13],[14]. These works show that the classification trees generated
using the CTC algorithm achieve smaller error rates than the ones built with C4.5,
giving this way a better quality to the explanation.
The paper proceeds with a description of our methodology for building
classification trees, Section 2. In Section 3, the description of the data set and the
experimental set-up is presented. This paper includes a summary of the results of our
previous work in Section 4. Section 5 presents the analysis of the structural stability
and convergence of the structure of trees built with CTC algorithm. Finally, Section 6
is devoted to summarise the conclusions and further work.
2 Consolidated Trees' Construction Algorithm
Consolidated Trees' Construction algorithm (CTC) uses several subsamples to build a
single tree. This technique is radically different from bagging, boosting, etc. The con-
sensus is achieved at each step of the tree's building process and only one tree is built.
Search WWH ::




Custom Search