Discretization - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

tion are characteristics present in the best performing methods. However, we can

see that all of them are global, thus identifying a trend towards binning methods.

•

In general, accuracy and kappa performance registered by discretizers do not differ

too much. The behavior in both evaluation metrics are quite similar, taking into

account that the differences in kappa are usually lower due to the compensation of

random success offered by it. Surprisingly, in DataSqueezer , accuracy and kappa

offer the greatest differences in behavior, but they are motivated by the fact that this

method focuses on obtaining simple rule sets, leaving precision in the background.

•

It is obvious that there is a direct dependence between discretization and the

classifier used. We have pointed out that a similar behavior in decision trees and

lazy/bayesian learning can be detected, whereas in rule induction learning, the

operation of the algorithm conditions the effectiveness of the discretizer. Knowing

a subset of suitable discretizers for each type of discretizer is a good starting point

to understand and propose improvements in the area.

•

Another interesting observation can be made about the relationship between accu-

racy and the number of intervals yielded by a discretizer. A discretizer that com-

putes few cut points does not have to obtain poor results in accuracy and vice

versa.

•

Finally, we can stress a subset of global best discretizers considering a trade-off

between the number of intervals and accuracy obtained. In this subset, we can

include FUSINTER , Distance , Chi2 , MDLP and UCPD .

On the other hand, an analysis centered on the 30 discretizers studied is given as

follows:

•

Many classic discretizers are usually the best performing ones. This is the case of

ChiMerge , MDLP , Zeta , Distance and Chi2 .

•

Other classic discretizers are not as good as they should be, considering that they

have been improved over the years: EqualWidth , EqualFrequency , 1R , ID3 (the

static version is much worse than the dynamic inserted in C4.5 operation), CADD ,

Bayesian and ClusterAnalysis .

•

Slight modifications of classic methods have greatly enhanced their results, such

as, for example, FUSINTER , Modified Chi2 , PKID and FFD ; but in other cases,

the extensions have diminished their performance: USD , Extended Chi2 .

•

Promising techniques that have been evaluated under unfavorable circumstances

are MVD and UCP , which are unsupervised methods useful for application to other

DM problems apart from classification.

•

Recent proposedmethods that have been demonstrated to be competitive compared

with classic methods and even outperforming them in some scenarios are Khiops ,

CAIM , MODL , Ameva and CACC . However, recent proposals that have reported

bad results in general are Heter-Disc , HellingerBD , DIBD , IDD and HDD .

•

Finally, this study involves a higher number of data sets than the quantity con-

sidered in previous works and the conclusions achieved are impartial towards

an specific discretizer. However, we have to stress some coincidences with the

conclusions of these previous works. For example in [ 105 ], the authors propose

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home