Discretization - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

metric discretizer computes the appropriate number of intervals for each attribute

considering a trade-off between the loss of information or consistency and obtain-

ing the lowest number of them. A parametric discretizer requires a maximum

number of intervals desired to be fixed by the user. Examples of non-parametric

discretizers are MDLP [ 41 ] and CAIM [ 70 ]. Examples of parametric ones are

ChiMerge [ 68 ] and CADD [ 24 ].

•

Top-Down versus Bottom Up: This property is only observed in incremental dis-

cretizers. Top-Down methods begin with an empty discretization. Its improve-

ment process is simply to add a new cutpoint to the discretization. On the other

hand, Bottom-Up methods begin with a discretization that contains all the possible

cutpoints. Its improvement process consists of iteratively merging two intervals,

removing a cut point. A classic Top-Downmethod isMDLP [ 41 ] and a well-known

Bottom-Up method is ChiMerge [ 68 ].

•

Stopping Condition: This is related to themechanismused to stop the discretization

process and must be specified in nonparametric approaches. Well-known stopping

criteria are the Minimum Description Length measure [ 41 ], confidence thresholds

[ 68 ], or inconsistency ratios [ 26 ].

•

Disjoint versus Non-Disjoint: Disjoint methods discretize the value range of the

attribute into disassociated intervals, without overlapping, whereas non-disjoint

methods dicsretize the value range into intervals that can overlap. The methods

reviewed in this chapter are disjoint, while fuzzy discretization is usually non-

disjoint [ 62 ].

•

Ordinal versus Nominal: Ordinal discretization transforms quantitative data into

ordinal qualitative data whereas nominal discretization transforms it into nominal

qualitative data, discarding the information about order. Ordinal discretizers are

less common, and not usually considered classic discretizers [ 80 ].

9.3.1.3 Criteria to Compare Discretization Methods

When comparing discretization methods, there are a number of criteria that can be

used to evaluate the relative strengths and weaknesses of each algorithm. These

include the number of intervals, inconsistency, predictive classification rate and time

requirements

•

Number of Intervals: A desirable feature for practical discretization is that dis-

cretized attributes have as few values as possible, since a large number of intervals

may make the learning slow and ineffective [ 19 ].

•

Inconsistency: A supervision-based measure used to compute the number of

unavoidable errors produced in the data set. An unavoidable error is one asso-

ciated to two examples with the same values for input attributes and different class

labels. In general, data sets with continuous attributes are consistent, but when

a discretization scheme is applied over the data, an inconsistent data set may be

obtained. The desired inconsistency level that a discretizer should obtain is 0.0.

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home