Database Reference
In-Depth Information
discoveries can be made accessible to human review [ Hunter and Klein
(1993) ] .
Comprehensibility can vary between different classifiers created by the
same inducer. For instance, in the case of decision trees, the size (number
of nodes) of the induced trees is also important. Smaller trees are preferred
because they are easier to interpret. There are also other reasons for
preferring smaller decision trees. According to a fundamental principle in
science, known as the Occam's razor, when searching for the explanation
of any phenomenon, one should make as few assumptions as possible, and
eliminating those that make no difference in the observable predictions of
the explanatory hypothesis. The implication in regard to decision trees is
that the tree which can be defined as the smallest decision tree that is
consistent with the training set is the one that is most likely to classify
unseen instances correctly. However, this is only a rule of thumb; in some
pathologic cases a large and unbalanced tree can still be easily interpreted
[ Buja and Lee (2001) ] . Moreover, the problem of finding the smallest
consistent tree is known to be NP-complete [Murphy and McCraw (1991)].
As the reader can see, the accuracy and complexity factors can be
quantitatively estimated; the comprehensibility is more subjective.
4.5 Scalability to Large Datasets
Scalability refers to the ability of the method to construct the classification
model eciently given large amounts of data. Classical induction algorithms
have been applied with practical success in many relatively simple and
small-scale problems. However, trying to discover knowledge in real life
and large databases introduces time and memory problems.
As large databases have become the norm in many fields (including
astronomy, molecular biology, finance, marketing, health care, and many
others), the use of data mining to discover patterns in them has become a
potentially very productive enterprise. Many companies are staking a large
part of their future on these “data mining” applications, and looking to
the research community for solutions to the fundamental problems they
encounter.
While a very large amount of available data used to be a dream of
any data analyst, nowadays the synonym for “very large” has become
“terabyte”, a hardly imaginable volume of information. Information-
intensive organizations (like telecom companies and banks) are supposed
to accumulate several terabytes of raw data every one to two years.
Search WWH ::




Custom Search