Discretization - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

where:

c

=

number of classes

N ij =

number of distinct values in the i th interval, j th class

c

R i

=

number of examples in i th interval

=

N ij

j = 1

m

C j

=

number of examples in j th class

=

N ij

i

=

1

c

N

=

total number of examples

=

C j

j

=

1

E ij =

expected frequency of N ij = (

R i ×

C j )/

N

ChiMerge is a supervised, bottom-up discretizer. At the beginning, each distinct

value of the attribute is considered to be one interval.

2 tests are performed for

every pair of adjacent intervals. Those adjacent intervals with the least

χ

2 value are

merged until the chosen stopping criterion is satisfied. The significance level for

χ

2 is an input parameter that determines the threshold for the stopping criterion.

Another parameter used is the called max-interval which can be included to avoid

the excessive number of intervals from being created. The recommended value for

the significance level should be included between the range from 0.90 to 0.99. The

max-interval parameter should be set to 10 or 15.

Chi2 [ 76 ]

It can be explained as an automated version of ChiMerge. Here, the statistical signif-

icance level keeps changing to merge more and more adjacent intervals as long as an

inconsistency criterion is satisfied. We understand inconsistency to be two instances

that match but belong to different classes. It is even possible to completely remove

an attribute because the inconsistency property does not appear during the process

of discretizing an attribute, acting as a feature selector. Like ChiMerge,

2 statistic

is used to discretize the continuous attributes until some inconsistencies are found

in the data.

The stopping criterion is achieved when there are inconsistencies in the data

considering a limit of zero or

χ

δ

inconsistency level as default.

Modified Chi2 [ 105 ]

In the original Chi2 algorithm, the stopping criterionwas defined as the point at which

the inconsistency rate exceeded a predefined rate

value could be given after

some tests on the training data for different data sets. The modification proposed

was to use the level of consistency checking coined from Rough Sets Theory. Thus,

this level of consistency replaces the basic inconsistency checking, ensuring that the

fidelity of the training data could be maintained to be the same after discretization

and making the process completely automatic.

δ

.The

δ

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home