Evaluation of Classification Trees - Data Mining with Decision Trees: Theory and Applications

Database Reference

In-Depth Information

discoveries can be made accessible to human review [ Hunter and Klein

(1993) ] .

Comprehensibility can vary between different classifiers created by the

same inducer. For instance, in the case of decision trees, the size (number

of nodes) of the induced trees is also important. Smaller trees are preferred

because they are easier to interpret. There are also other reasons for

preferring smaller decision trees. According to a fundamental principle in

science, known as the Occam's razor, when searching for the explanation

of any phenomenon, one should make as few assumptions as possible, and

eliminating those that make no difference in the observable predictions of

the explanatory hypothesis. The implication in regard to decision trees is

that the tree which can be defined as the smallest decision tree that is

consistent with the training set is the one that is most likely to classify

unseen instances correctly. However, this is only a rule of thumb; in some

pathologic cases a large and unbalanced tree can still be easily interpreted

[ Buja and Lee (2001) ] . Moreover, the problem of finding the smallest

consistent tree is known to be NP-complete [Murphy and McCraw (1991)].

As the reader can see, the accuracy and complexity factors can be

quantitatively estimated; the comprehensibility is more subjective.

4.5 Scalability to Large Datasets

Scalability refers to the ability of the method to construct the classification

model eciently given large amounts of data. Classical induction algorithms

have been applied with practical success in many relatively simple and

small-scale problems. However, trying to discover knowledge in real life

and large databases introduces time and memory problems.

As large databases have become the norm in many fields (including

astronomy, molecular biology, finance, marketing, health care, and many

others), the use of data mining to discover patterns in them has become a

potentially very productive enterprise. Many companies are staking a large

part of their future on these “data mining” applications, and looking to

the research community for solutions to the fundamental problems they

encounter.

While a very large amount of available data used to be a dream of

any data analyst, nowadays the synonym for “very large” has become

“terabyte”, a hardly imaginable volume of information. Information-

intensive organizations (like telecom companies and banks) are supposed

to accumulate several terabytes of raw data every one to two years.

Search WWH ::

Custom Search

Home