Databases Reference
In-Depth Information
domain into equal width intervals is called Equal Interval Width Method .An-
other method of discretization, called Equal Frequency per Interval Method ,
is based on dividing the domain of a numerical attribute into subsets with
approximately equal relative frequencies of attribute values. In this method
the discretized attribute entropy is maximum. In our experiments we used
another discretization method, based on minimum entropy as a criterion to
evaluate a list of best cutpoints. This method was used together with the
original LEM2 algorithm as one of our three approaches for rule induction
from numerical data. The two other approaches were two different versions of
MLEM2: with and without merging intervals.
Our main objective was to compare performance of these three approaches
for rule induction from numerical data. As was expected, the newest version
of MLEM2 produces the smallest total number of conditions in rule sets.
However, performance in terms of accuracy is approximately the same for
all three approaches. Using different data sets than reported in this paper,
MLEM2 with merging intervals was compared with two other approaches to
rule induction from numerical data: discretization based on agglomerative and
divisive cluster analysis and then LEM2 in [8].
Note that results of experiments comparing the quality of rule sets, induced
by ID3, in terms of accuracy, with and without dropping conditions, were
published in [10]. A preliminary version of this chapter was presented at the
Workshop on Foundations of Semantic Oriented Data and WEB Mining, in
conjunction with the ICDM'05, Fifth IEEE International Conference on Data
Mining, Houston, TX, November 27-30, 2005.
2 Discretization Algorithm Based on Minimum Entropy
The discretization method, based on minimum entropy, was suggested in [3],
and is also called Minimal Class Entropy Method . A similar process of dis-
cretization is used in C4.5 [15].
We will present our own discretization algorithm, introduced in [2]. Dis-
cretization is a conversion of domains of numerical attributes into intervals [6].
Such intervals are defined by cutpoints, numbers limiting the intervals. Let us
say the set of all cases of a data set will be denoted by U . A single cutpoint q ,
a number from the domain of a numerical attribute a , defines two intervals,
containing two subsets S 1 and S 2 of U . For an attribute a , the conditional
entropy of a cutpoint q is
E ( a,U,q )= |
S 1 |
|
E ( S 1 )+ |
S 2 |
|
E ( S 2 ) ,
U
|
U
|
where E ( S ) is the entropy of a subset S of U . The entropy E ( S ) is computed
in the standard way as
n
p j log p j ,
j =1
Search WWH ::




Custom Search