Discretization - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

continuous values are placed in each bin. Thus, the width of each interval is computed

by dividing the length of the attribute range by the desired arity.

There is no need for a stopping criterion as the number of bins is computed directly

at the beginning of the process. In practice, both methods are equivalent as they only

depend on the number of bins desired, regardless of the calculation method.

MDLP [ 41 ]

This discretizer uses the entropy measure to evaluate candidate cut points. Entropy is

one of the most commonly used discretization measures in the literature. The entropy

of a sample variable X is

(

) =−

p x log p x

where x represents a value of X and p x its estimated probability of occurring. It

corresponds to the average amount of information per event where information of an

event is defines as:

(

) =−

log p x

Information is high for lower probable events and low otherwise. This discretizer

uses the Information Gain of a cut point, which is defined as

) − |

S 1 |

S 1 ) − |

S 2 |

(

;

) =

(

) −

(

;

) =

(

S 2 )

where A is the attribute in question, T is a candidate cut point and S is the set of N

examples. So, S i is a partitioned subset of examples produced by T .

The MDLP discretizer applies the Minimum Description Length Principle to

decide the acceptation or rejection for each cut point and to govern the stopping

criterion. It is defined in information theory to be the minimum number of bits

required to uniquely specify an object out of the universe of all objects. It computes

a final cost of coding and takes part in the decision making of the discretizer.

In summary, the MDLP criterion is that the partition induced by a cut point T for

aset S of N examples is accepted iff

log 2 (

−

)

+ δ(

;

)

(

;

3 c

where

. We recall

that c stands for the number of classes of the data set in supervised learning.

The stopping criterion is given in the MDLP itself, due to the fact that

δ(

;

) =

log 2 (

−

) −[

(

) −

c 1 ·

(

S 1 ) −

c 2 ·

(

S 2 ) ]

δ(

;

)

acts as a threshold to stop accepting new partitions.

Distance [ 20 ]

This method introduces a distance measure called Mantaras Distance to evaluate the

cut points. Let us consider two partitions S a and S b on a range of continuous values,

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home