Graphics Reference
In-Depth Information
Chapter 9
Discretization
Abstract Discretization is an essential preprocessing technique used in many
knowledge discovery and data mining tasks. Its main goal is to transform a set of con-
tinuous attributes into discrete ones, by associating categorical values to intervals and
thus transforming quantitative data into qualitative data. An overview of discretiza-
tion together with a complete outlook and taxonomy are supplied in Sects. 9.1 and 9.2 .
We conduct an experimental study in supervised classification involving the most
representative discretizers, different types of classifiers, and a large number of data
sets (Sect. 9.4 ).
9.1 Introduction
As it was mentioned in the introduction of this topic, data usually comes in different
formats, such as discrete, numerical, continuous, categorical, etc. Numerical data,
provided by discrete or continuous values, assumes that the data is ordinal, there is
an order among the values. However, in categorical data, no order can be assumed
amongst them. The domain and type of data is crucial to the learning task to be
performed next. For example, in a decision tree induction process a feature must
be chosen from a subset based on some metric gain associated with its values. This
process usually requires inherent finite values and also prefers to perform a branch of
values that are not ordered. Obviously, the tree structure is a finite structure and there
is a need to split the feature to produce the associated nodes in further divisions. If
data is continuous, there is a need to discretize the features either before the decision
tree induction or throughout the process of tree modelling.
Discretization, as one of the basic data reduction techniques, has received increas-
ing research attention in recent years [ 75 ] and has become one of the preprocessing
techniques most broadly used in DM. The discretization process transforms quanti-
tative data into qualitative data, that is, numerical attributes into discrete or nominal
attributes with a finite number of intervals, obtaining a non-overlapping partition of a
continuous domain. An association between each interval with a numerical discrete
value is then established. In practice, discretization can be viewed as a data reduc-
tion method since it maps data from a huge spectrum of numeric values to a greatly
 
Search WWH ::




Custom Search