Discretization - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

or consistency. For example, a threshold for m can be an upper bound for the arity

of the resulting discretization. A stopping criterion can be very simple such as fixing

the number of final intervals at the beginning of the process or more complex like

estimating a function.

9.2.2 Related and Advanced Work

Research in improving and analyzing discretization is common and in high demand

currently. Discretization is a promising technique to obtain the hoped results, depend-

ing on the DM task, which justifies its relationship to other methods and problems.

This section provides a brief summary of topics closely related to discretization from

a theoretical and practical point of view and describes other works and future trends

which have been studied in the last few years.

•

Discretization Specific Analysis: Susmaga proposed an analysis method for dis-

cretizers based on binarization of continuous attributes and rough sets measures

[ 104 ]. He emphasized that his analysis method is useful for detecting redundancy

in discretization and the set of cut points which can be removed without decreas-

ing the performance. Also, it can be applied to improve existing discretization

approaches.

•

Optimal Multisplitting: Elomaa and Rousu characterized some fundamental prop-

erties for using some classic evaluation functions in supervised univariate dis-

cretization. They analyzed entropy, information gain, gain ratio, training set error,

Gini index and normalized distance measure, concluding that they are suitable

for use in the optimal multisplitting of an attribute [ 38 ]. They also developed

an optimal algorithm for performing this multisplitting process and devised two

techniques [ 39 , 40 ] to speed it up.

•

Discretization of Continuous Labels: Two possible approaches have been used

in the conversion of a continuous supervised learning (regression problem) into a

nominal supervised learning (classification problem). The first one is simply to use

regression tree algorithms, such as CART [ 17 ]. The second consists of applying

discretization to the output attribute, either statically [ 46 ] or in a dynamic fashion

[ 61 ].

•

Fuzzy Discretization: Extensive research has been carried out around the definition

of linguistic terms that divide the domain attribute into fuzzy regions [ 62 ]. Fuzzy

discretization is characterized by membership value, group or interval number and

affinity corresponding to an attribute value, unlike crisp discretization which only

considers the interval number [ 95 ].

•

Cost-Sensitive Discretization: The objective of cost-based discretization is to take

into account the cost of making errors instead of just minimizing the total sum of

errors [ 63 ]. It is related to problems of imbalanced or cost-sensitive classification

[ 57 , 103 ].

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home