Discretization - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

the learning algorithm [ 75 ]. Almost all known discretizers are static, due to the

fact that most of the dynamic discretizers are really subparts or stages of DM

algorithms when dealing with numerical data [ 13 ]. Some examples of well-known

dynamic techniques are ID3 discretizer [ 92 ] and ITFP [ 6 ].

•

Univariate versus Multivariate: Multivariate techniques, also known as 2D dis-

cretization [ 81 ], simultaneously consider all attributes to define the initial set of

cut points or to decide the best cut point altogether. They can also discretize one

attribute at a time when studying the interactions with other attributes, exploiting

high order relationships. By contrast, univariate discretizers only work with a sin-

gle attribute at a time, once an order among attributes has been established, and

the resulting discretization scheme in each attribute remains unchanged in later

stages. Interest has recently arisen in developing multivariate discretizers since

they are very influential in deductive learning [ 10 , 49 ] and in complex classifi-

cation problems where high interactions among multiple attributes exist, which

univariate discretizers might obviate [ 42 , 121 ].

•

Supervised versus Unsupervised: Unsupervised discretizers do not consider the

class label whereas supervised ones do. The manner in which the latter consider

the class attribute depends on the interaction between input attributes and class

labels, and the heuristic measures used to determine the best cut points (entropy,

interdependence, etc.). Most discretizers proposed in the literature are supervised

and theoretically using class information, should automatically determine the best

number of intervals for each attribute. If a discretizer is unsupervised, it does

not mean that it cannot be applied over supervised tasks. However, a supervised

discretizer can only be applied over supervised DM problems. Representative

unsupervised discretizers are EqualWidth and EqualFrequency [ 73 ], PKID and

FFD [ 122 ] and MVD [ 10 ].

•

Splitting versus Merging: This refers to the procedure used to create or define new

intervals. Splitting methods establish a cut point among all the possible boundary

points and divide the domain into two intervals. By contrast, mergingmethods start

with a pre-defined partition and remove a candidate cut point to mix both adjacent

intervals. These properties are highly related to Top-Down and Bottom-up respec-

tively (explained in the next section). The idea behind them is very similar, except

that top-down or bottom-up discretizers assume that the process is incremental

(described later), according to a hierarchical discretization construction. In fact,

there can be discretizers whose operation is based on splitting or merging more

than one interval at a time [ 72 , 96 ]. Also, some discretizers can be considered

hybrid due to the fact that they can alternate splits with merges in running time

[ 24 , 43 ].

•

Global versus Local: To make a decision, a discretizer can either require all

available data in the attribute or use only partial information. A discretizer is said

to be local when it only makes the partition decision based on local information.

Examples of widely used local techniques are MDLP [ 41 ] and ID3 [ 92 ]. Few

discretizers are local, except some based on top-down partition and all the dynamic

techniques. In a top-down process, some algorithms follow the divide-and-conquer

scheme and when a split is found, the data is recursively divided, restricting access

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home