Discretization - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

reduced subset of discrete values. Once the discretization is performed, the data can

be treated as nominal data during any induction or deduction DM process. Many

existing DM algorithms are designed only to learn in categorical data, using nom-

inal attributes, while real-world applications usually involve continuous features.

Numerical features have to be discretized before using such algorithms.

In supervised learning, and specifically classification, the topic of this survey,

we can define the discretization as follows. Assuming a data set consisting of N

examples and C target classes, a discretization algorithmwould discretize the contin-

uous attribute A in this data set into m discrete intervals D

={[

d 0 ,

d 1 ] ,(

d 1 ,

d 2 ] ,...,

(

d m − 1 ,

d m ]}

, where d 0 is the minimal value, d m is the maximal value and d i

d i + i ,

for i

,...,

−

1. Such a discrete result D is called a discretization scheme

on attribute A and P

is the set of cut points of attribute A .

The necessity of using discretization on data can be caused by several factors.

Many DM algorithms are primarily oriented to handle nominal attributes [ 36 , 75 ,

123 ], or may even only deal with discrete attributes. For instance, three of the ten

methods considered as the top ten in DM [ 120 ] require an embedded or an external

discretization of data: C4.5 [ 92 ], Apriori [ 1 ] and Naïve Bayes [ 44 , 122 ]. Even with

algorithms that are able to deal with continuous data, learning is less efficient and

effective [ 29 , 94 ]. Other advantages derived from discretization are the reduction and

the simplification of data, making the learning faster and yieldingmore accurate, with

compact and shorter results; and any noise possibly present in the data is reduced.

For both researchers and practitioners, discrete attributes are easier to understand,

use, and explain [ 75 ]. Nevertheless, any discretization process generally leads to a

loss of information, making the minimization of such information loss the main goal

of a discretizer.

Obtaining the optimal discretization is NP-complete [ 25 ]. A vast number of dis-

cretization techniques can be found in the literature. It is obvious that when dealing

with a concrete problem or data set, the choice of a discretizer will condition the suc-

cess of the posterior learning task in accuracy, simplicity of the model, etc. Different

heuristic approaches have been proposed for discretization, for example, approaches

based on information entropy [ 36 , 41 ], statistical

d 1 ,

d 2 ,...,

d m − 1 }

2 test [ 68 , 76 ], likelihood [ 16 ,

119 ], rough sets [ 86 , 124 ], etc. Other criteria have been used in order to provide a clas-

sification of discretizers, such as univariate/multivariate, supervised/unsupervised,

top-down/bottom-up, global/local, static/dynamic and more. All these criteria are

the basis of the taxonomies already proposed and they will be deeply elaborated

upon in this chapter. The identification of the best discretizer for each situation is a

very difficult task to carry out, but performing exhaustive experiments considering a

representative set of learners and discretizers could help to make the best choice.

Some reviews of discretization techniques can be found in the literature [ 9 , 36 , 75 ,

123 ]. However, the characteristics of the methods are not studied completely, many

discretizers, even classic ones, are not mentioned, and the notation used for catego-

rization is not unified. In spite of the wealth of literature, and apart from the absence of

a complete categorization of discretizers using a unified notation, it can be observed

that, there are few attempts to empirically compare them. In this way, the algorithms

proposed are usually compared with a subset of the complete family of discretizers

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home