Consolidated Trees: An Analysis of Structural Convergence - Data Mining: Theory, Methodology, Techniques, and Applications

Database Reference

In-Depth Information

Consolidated Trees: An Analysis of Structural

Convergence

Jesús M. Pérez, Javier Muguerza, Olatz Arbelaitz,

Ibai Gurrutxaga, and José I. Martín

Dept. of Computer Architecture and Technology, University of the Basque Country,

M. Lardizabal, 1, 20018 Donostia, Spain

{txus.perez, j.muguerza, olatz.arbelaitz,

ibai.gurrutxaga, j.martin}@ehu.es

http://www.sc.ehu.es/aldapa

Abstract. When different subsamples of the same data set are used to induce

classification trees, the structure of the built classifiers is very different. The

stability of the structure of the tree is of capital importance in many domains,

such as illness diagnosis, fraud detection in different fields, customer's

behaviour analysis (marketing), etc, where comprehensibility of the classifier is

necessary. We have developed a methodology for building classification trees

from multiple samples where the final classifier is a single decision tree

(Consolidated Trees). The paper presents an analysis of the structural stability

of our algorithm versus C4.5 algorithm. The classification trees generated with

our algorithm, achieve smaller error rates and structurally more steady trees

than C4.5 when using resampling techniques. The main focus on this paper is

showing how Consolidated Trees built with different sets of subsamples tend to

converge to the same tree when the number of used subsamples is increased.

1 Introduction

Many examples of the use of resampling techniques

⎯

oversampling or

undersampling

with different objectives can be found in bibliography. A very

important application of resampling is to use it in order to equilibrate the class

distribution in databases with class imbalance [12],[18]. In many areas, such as

medicine, fraud detection, etc; cases of one of the classes can be difficult to obtain.

This leads very often to class imbalance in the data set which, in general, does not

even coincide with the distribution expected in reality. A similar case is the one of

databases with non-uniform cost, where the misclassification cost is not the same for

the whole confusion matrix. In these cases, if the algorithm does not take into account

the cost-matrix in the induction process, the use of resampling techniques to make

some errors become more important than others can be a way of introducing such a

cost in the learning algorithm [9]. On the other hand, for some databases the use of

machine learning algorithms is computationally too expensive due to their memory

requirements. In these cases resampling can be used for size reduction [4],[16]. We

can not forget one of the most extended uses of resampling techniques: the

⎯

Search WWH ::

Custom Search

Home