Discretization - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

to partial data. Regarding dynamic discretizers, they find the cut points in internal

operations of a DM algorithm, so they never gain access to the full data set.

•

Direct versus Incremental: Direct discretizers divide the range into k intervals

simultaneously, requiring an additional criterion to determine the value of k .They

do not only include one-step discretization methods, but also discretizers which

perform several stages in their operation, selecting more than a single cut point at

every step. By contrast, incremental methods begin with a simple discretization

and pass through an improvement process, requiring an additional criterion to

know when to stop it. At each step, they find the best candidate boundary to be

used as a cut point and afterwards the rest of the decisions are made accordingly.

Incremental discretizers are also known as hierarchical discretizers [ 9 ]. Both types

of discretizers are widespread in the literature, although there is usually a more

defined relationship between incremental and supervised ones.

•

Evaluation Measure: This is the metric used by the discretizer to compare two

candidate schemes and decide which is more suitable to be used. We consider five

main families of evaluation measures:

- Information: This family includes entropy as the most used evaluation measure

in discretization (MDLP [ 41 ], ID3 [ 92 ], FUSINTER [ 126 ]) and other derived

information theory measures such as the Gini index [ 66 ].

- Statistical: Statistical evaluation involves the measurement of dependency/

correlation among attributes (Zeta [ 58 ], ChiMerge [ 68 ], Chi2 [ 76 ]), probability

and bayesian properties [ 119 ] (MODL [ 16 ]), interdependency [ 70 ], contingency

coefficient [ 106 ], etc.

- Rough Sets: This group is composed of methods that evaluate the discretization

schemes by using rough set measures and properties [ 86 ], such as lower and

upper approximations, class separability, etc.

- Wrapper: This collection comprises methods that rely on the error provided by

a classifier that is run for each evaluation. The classifier can be a very simple

one, such as a majority class voting classifier (Valley [ 108 ]) or general classifiers

such as Naïve Bayes (NBIterative [ 87 ]).

- Binning: This category refers to the absence of an evaluation measure. It is the

simplest method to discretize an attribute by creating a specified number of bins.

Each bin is defined as a priori and allocates a specified number of values per

attribute. Widely used binning methods are EqualWidth and EqualFrequency.

9.3.1.2 Other Properties

We can discuss other properties related to discretization which also influence the

operation and results obtained by a discretizer, but to a lower degree than the

characteristics explained above. Furthermore, some of them present a large variety

of categorizations and may harm the interpretability of the taxonomy.

•

Parametric versus Non-Parametric: This property refers to the automatic determi-

nation of the number of intervals for each attribute by the discretizer. A nonpara-

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home