Discretization - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

•

Regarding the number of intervals, the discretizers which divide the numerical

attributes in fewer intervals are Heter-Disc , MVD and Distance , whereas dis-

cretizers which require a large number of cut points are HDD , ID3 and Bayesian .

The Wilcoxon test confirms that Heter-Disc is the discretizer that obtains the least

intervals outperforming the rest.

•

The inconsistency rate both in training data and test data follows a similar trend for

all discretizers, considering that the inconsistency obtained in test data is always

lower than in training data. ID3 is the discretizer that obtains the lowest average

inconsistency rate in training and test data, albeit the Wilcoxon test cannot find

significant differences between it and the other two discretizers: FFD and PKID .

We can observe a close relationship between the number of intervals produced and

the inconsistency rate, where discretizers that compute fewer cut points are usually

those which have a high inconsistency rate. They risk the consistency of the data

in order to simplify the result, although the consistency is not usually correlated

with the accuracy, as we will see below.

•

In decision trees ( C4.5 and PUBLIC ), a subset of discretizers can be stressed as

the best performing ones. Considering average accuracy, FUSINTER , ChiMerge

and CAIM stand out from the rest. Considering average kappa, Zeta and MDLP

are also added to this subset. The Wilcoxon test confirms this result and adds

another discretizer, Distance , which outperforms 16 of the 29 methods. All meth-

ods emphasized are supervised, incremental (except Zeta ) and use statistical and

information measures as evaluators. Splitting/Merging and Local/Global proper-

ties have no effect on decision trees.

•

Considering rule induction ( DataSqueezer and Ripper ), the best performing dis-

cretizers are Distance , Modified Chi2 , Chi2 , PKID and MODL in average accuracy

and CACC , Ameva , CAIM and FUSINTER in average kappa. In this case, the results

are very irregular due to the fact that the Wilcoxon test emphasizes the ChiMerge

as the best performing discretizer for DataSqueezer instead of Distance and incor-

porates Zeta in the subset. With Ripper , the Wilcoxon test confirms the results

obtained by averaging accuracy and kappa. It is difficult to discern a common

set of properties that define the best performing discretizers due to the fact that

rule induction methods differ in their operation to a greater extent than decision

trees. However, we can say that, in the subset of best methods, incremental and

supervised discretizers predominate in the statistical evaluation.

•

Lazy and bayesian learning can be analyzed together, due to the fact that the

HVDM distance used in KNN is highly related to the computation of bayesian

probabilities considering attribute independence [ 114 ]. With respect to lazy and

bayesian learning, KNN and Naïve Bayes , the subset of remarkable discretizers is

formed by PKID , FFD , Modified Chi2 , FUSINTER , ChiMerge , CAIM , EqualWidth

and Zeta , when average accuracy is used; and Chi2 , Khiops , EqualFrequency and

MODL must be added when average kappa is considered. The statistcal report by

Wilcoxon informs us of the existence of two outstanding methods: PKID for KNN ,

which outperforms 27/29 and FUSINTER for Naïve Bayes . Here, supervised and

unsupervised, direct and incremental, binning and statistical/information evalua-

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home