Graphics Reference
In-Depth Information
tion are characteristics present in the best performing methods. However, we can
see that all of them are global, thus identifying a trend towards binning methods.
In general, accuracy and kappa performance registered by discretizers do not differ
too much. The behavior in both evaluation metrics are quite similar, taking into
account that the differences in kappa are usually lower due to the compensation of
random success offered by it. Surprisingly, in DataSqueezer , accuracy and kappa
offer the greatest differences in behavior, but they are motivated by the fact that this
method focuses on obtaining simple rule sets, leaving precision in the background.
It is obvious that there is a direct dependence between discretization and the
classifier used. We have pointed out that a similar behavior in decision trees and
lazy/bayesian learning can be detected, whereas in rule induction learning, the
operation of the algorithm conditions the effectiveness of the discretizer. Knowing
a subset of suitable discretizers for each type of discretizer is a good starting point
to understand and propose improvements in the area.
Another interesting observation can be made about the relationship between accu-
racy and the number of intervals yielded by a discretizer. A discretizer that com-
putes few cut points does not have to obtain poor results in accuracy and vice
versa.
Finally, we can stress a subset of global best discretizers considering a trade-off
between the number of intervals and accuracy obtained. In this subset, we can
include FUSINTER , Distance , Chi2 , MDLP and UCPD .
On the other hand, an analysis centered on the 30 discretizers studied is given as
follows:
Many classic discretizers are usually the best performing ones. This is the case of
ChiMerge , MDLP , Zeta , Distance and Chi2 .
Other classic discretizers are not as good as they should be, considering that they
have been improved over the years: EqualWidth , EqualFrequency , 1R , ID3 (the
static version is much worse than the dynamic inserted in C4.5 operation), CADD ,
Bayesian and ClusterAnalysis .
Slight modifications of classic methods have greatly enhanced their results, such
as, for example, FUSINTER , Modified Chi2 , PKID and FFD ; but in other cases,
the extensions have diminished their performance: USD , Extended Chi2 .
Promising techniques that have been evaluated under unfavorable circumstances
are MVD and UCP , which are unsupervised methods useful for application to other
DM problems apart from classification.
Recent proposedmethods that have been demonstrated to be competitive compared
with classic methods and even outperforming them in some scenarios are Khiops ,
CAIM , MODL , Ameva and CACC . However, recent proposals that have reported
bad results in general are Heter-Disc , HellingerBD , DIBD , IDD and HDD .
Finally, this study involves a higher number of data sets than the quantity con-
sidered in previous works and the conclusions achieved are impartial towards
an specific discretizer. However, we have to stress some coincidences with the
conclusions of these previous works. For example in [ 105 ], the authors propose
 
Search WWH ::




Custom Search