Database Reference
In-Depth Information
Consolidated Trees: An Analysis of Structural
Convergence
Jesús M. Pérez, Javier Muguerza, Olatz Arbelaitz,
Ibai Gurrutxaga, and José I. Martín
Dept. of Computer Architecture and Technology, University of the Basque Country,
M. Lardizabal, 1, 20018 Donostia, Spain
{txus.perez, j.muguerza, olatz.arbelaitz,
ibai.gurrutxaga, j.martin}@ehu.es
http://www.sc.ehu.es/aldapa
Abstract. When different subsamples of the same data set are used to induce
classification trees, the structure of the built classifiers is very different. The
stability of the structure of the tree is of capital importance in many domains,
such as illness diagnosis, fraud detection in different fields, customer's
behaviour analysis (marketing), etc, where comprehensibility of the classifier is
necessary. We have developed a methodology for building classification trees
from multiple samples where the final classifier is a single decision tree
(Consolidated Trees). The paper presents an analysis of the structural stability
of our algorithm versus C4.5 algorithm. The classification trees generated with
our algorithm, achieve smaller error rates and structurally more steady trees
than C4.5 when using resampling techniques. The main focus on this paper is
showing how Consolidated Trees built with different sets of subsamples tend to
converge to the same tree when the number of used subsamples is increased.
1 Introduction
Many examples of the use of resampling techniques
oversampling or
undersampling
with different objectives can be found in bibliography. A very
important application of resampling is to use it in order to equilibrate the class
distribution in databases with class imbalance [12],[18]. In many areas, such as
medicine, fraud detection, etc; cases of one of the classes can be difficult to obtain.
This leads very often to class imbalance in the data set which, in general, does not
even coincide with the distribution expected in reality. A similar case is the one of
databases with non-uniform cost, where the misclassification cost is not the same for
the whole confusion matrix. In these cases, if the algorithm does not take into account
the cost-matrix in the induction process, the use of resampling techniques to make
some errors become more important than others can be a way of introducing such a
cost in the learning algorithm [9]. On the other hand, for some databases the use of
machine learning algorithms is computationally too expensive due to their memory
requirements. In these cases resampling can be used for size reduction [4],[16]. We
can not forget one of the most extended uses of resampling techniques: the
 
Search WWH ::




Custom Search