Discretization - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

and, in most of the studies, no rigorous empirical analysis has been carried out. In

[ 51 ], it was noticed that the most compared techniques are EqualWidth, EqualFre-

quency, MDLP [ 41 ], ID3 [ 92 ], ChiMerge [ 68 ], 1R [ 59 ], D2 [ 19 ] and Chi2 [ 76 ].

These reasons motivate the global purpose of this chapter. We can summarize it

into three main objectives:

•

To provide an updated and complete taxonomy based on the main properties

observed in the discretization methods. The taxonomy will allow us to charac-

terize their advantages and drawbacks in order to choose a discretizer from a

theoretical point of view.

•

To make an empirical study analyzing the most representative and newest dis-

cretizers in terms of the number of intervals obtained and inconsistency level of

the data.

•

Finally, to relate the best discretizers to a set of representative DM models using

two metrics to measure the predictive classification success.

9.2 Perspectives and Background

Discretization is a wide field and there have been many advances and ideas over the

years. This section is devoted to provide a proper background on the topic, together

with a set of related areas and future perspectives on discretization.

9.2.1 Discretization Process

Before starting, we must first introduce some terms used by different sources for the

sake of unification.

9.2.1.1 Feature

Also called attribute or variable refers to an aspect of the data and it is usually

associated to the columns in a data table. M stands for the number of features in the

data.

9.2.1.2 Instance

Also called tuple , example , record or data point refers to a collection of feature

values for all features. A set of instances constitute a data set and they are usually

associated to row in a data table. According to the introduction, we will set N as the

number of instances in the data.

Search WWH ::

Custom Search

Home