Graphics Reference
In-Depth Information
continuous values are placed in each bin. Thus, the width of each interval is computed
by dividing the length of the attribute range by the desired arity.
There is no need for a stopping criterion as the number of bins is computed directly
at the beginning of the process. In practice, both methods are equivalent as they only
depend on the number of bins desired, regardless of the calculation method.
MDLP [ 41 ]
This discretizer uses the entropy measure to evaluate candidate cut points. Entropy is
one of the most commonly used discretization measures in the literature. The entropy
of a sample variable X is
H
(
X
) =−
p x log p x
x
where x represents a value of X and p x its estimated probability of occurring. It
corresponds to the average amount of information per event where information of an
event is defines as:
I
(
x
) =−
log p x
Information is high for lower probable events and low otherwise. This discretizer
uses the Information Gain of a cut point, which is defined as
) |
S 1 |
N
S 1 ) |
S 2 |
N
(
,
;
) =
(
)
(
,
;
) =
(
(
(
S 2 )
G
A
T
S
H
S
H
A
T
S
H
S
H
H
where A is the attribute in question, T is a candidate cut point and S is the set of N
examples. So, S i is a partitioned subset of examples produced by T .
The MDLP discretizer applies the Minimum Description Length Principle to
decide the acceptation or rejection for each cut point and to govern the stopping
criterion. It is defined in information theory to be the minimum number of bits
required to uniquely specify an object out of the universe of all objects. It computes
a final cost of coding and takes part in the decision making of the discretizer.
In summary, the MDLP criterion is that the partition induced by a cut point T for
aset S of N examples is accepted iff
log 2 (
N
1
)
+ δ(
A
,
T
;
S
)
G
(
A
,
T
;
S
)>
N
N
3 c
where
. We recall
that c stands for the number of classes of the data set in supervised learning.
The stopping criterion is given in the MDLP itself, due to the fact that
δ(
A
,
T
;
S
) =
log 2 (
2
) −[
c
·
H
(
S
)
c 1 ·
H
(
S 1 )
c 2 ·
H
(
S 2 ) ]
δ(
A
,
T
;
S
)
acts as a threshold to stop accepting new partitions.
Distance [ 20 ]
This method introduces a distance measure called Mantaras Distance to evaluate the
cut points. Let us consider two partitions S a and S b on a range of continuous values,
 
Search WWH ::




Custom Search